← Back to Paper List

Minerva: A Programmable Memory Test Benchmark for Language Models

M Xia, V Ruehle, S Rajmohan, R Shokri
Microsoft Research
arXiv, 2/2025 (2025)
Memory Benchmark

📝 Paper Summary

Memory recall Context utilization benchmarks Long-context evaluation
The Minerva Memory Test framework automatically generates atomic and composite tests to evaluate specific memory capabilities of LLMs beyond simple search, revealing that strong retrieval does not imply effective reasoning or state tracking.
Core Problem
Existing long-context benchmarks primarily focus on simple retrieval (e.g., 'needle-in-a-haystack'), failing to capture complex capabilities like state tracking, editing, or comparing information distributed across context.
Why it matters:
  • Static benchmarks are prone to overfitting and becoming obsolete as models improve.
  • Performing real-world tasks requires multiple memory capabilities (synthesis, association, temporal tracking), not just search.
  • Current evaluations mask specific weaknesses; a model might pass a retrieval test but fail to update or compare the retrieved information.
Concrete Example: In a 'Stateful Processing' task where an assistant must track a numerical value modified by a sequence of operations (e.g., 'add 5', 'subtract 2') scattered through a long context, models like Phi-3-medium fail almost completely despite performing well on basic search.
Key Novelty
Programmable Framework for Atomic and Composite Memory Tests
  • Decomposes memory usage into 'atomic' capabilities (Search, Recall & Edit, Match & Compare, Compute on Sets, Stateful Processing) to isolate specific model strengths and weaknesses.
  • Introduces 'composite' tests that combine these atoms to simulate real-world complexity, such as tracking information flow across distinct memory compartments (e.g., Theory of Mind scenarios).
  • Uses parametric programs to generate randomized, dynamic test cases, preventing overfitting and allowing adjustable difficulty (e.g., context length, complexity).
Evaluation Highlights
  • GPT-4o achieves near-perfect performance on integer state tracking (accuracy ~1.0 within 200 steps), while open-source models like Mistral and Phi-3 fail almost completely (accuracy near 0.0).
  • All models suffer significant performance drops on composite tasks; e.g., on 'Theory of Mind', even GPT-4o drops to ~0.45 accuracy compared to its strong atomic performance.
  • Models exhibit high variance on 'Recall and Edit' tasks; for example, functional updates (e.g., 'subtract 1 from every number') cause steep declines compared to simple snapshot recall.
Breakthrough Assessment
8/10
Provides a much-needed, granular taxonomy of memory capabilities beyond simple retrieval. The programmable nature and focus on state/logic within context offer a robust tool for diagnosing LLM limitations.
×