Minerva: A Programmable Memory Test Benchmark for Language Models

📝 Paper Summary

Memory recall Context utilization benchmarks Long-context evaluation

The Minerva Memory Test framework automatically generates atomic and composite tests to evaluate specific memory capabilities of LLMs beyond simple search, revealing that strong retrieval does not imply effective reasoning or state tracking.

Core Problem

Existing long-context benchmarks primarily focus on simple retrieval (e.g., 'needle-in-a-haystack'), failing to capture complex capabilities like state tracking, editing, or comparing information distributed across context.

Why it matters:

Static benchmarks are prone to overfitting and becoming obsolete as models improve.
Performing real-world tasks requires multiple memory capabilities (synthesis, association, temporal tracking), not just search.
Current evaluations mask specific weaknesses; a model might pass a retrieval test but fail to update or compare the retrieved information.

Concrete Example: In a 'Stateful Processing' task where an assistant must track a numerical value modified by a sequence of operations (e.g., 'add 5', 'subtract 2') scattered through a long context, models like Phi-3-medium fail almost completely despite performing well on basic search.

Key Novelty

Programmable Framework for Atomic and Composite Memory Tests

Decomposes memory usage into 'atomic' capabilities (Search, Recall & Edit, Match & Compare, Compute on Sets, Stateful Processing) to isolate specific model strengths and weaknesses.
Introduces 'composite' tests that combine these atoms to simulate real-world complexity, such as tracking information flow across distinct memory compartments (e.g., Theory of Mind scenarios).
Uses parametric programs to generate randomized, dynamic test cases, preventing overfitting and allowing adjustable difficulty (e.g., context length, complexity).

Evaluation Highlights

GPT-4o achieves near-perfect performance on integer state tracking (accuracy ~1.0 within 200 steps), while open-source models like Mistral and Phi-3 fail almost completely (accuracy near 0.0).
All models suffer significant performance drops on composite tasks; e.g., on 'Theory of Mind', even GPT-4o drops to ~0.45 accuracy compared to its strong atomic performance.
Models exhibit high variance on 'Recall and Edit' tasks; for example, functional updates (e.g., 'subtract 1 from every number') cause steep declines compared to simple snapshot recall.

Breakthrough Assessment

8/10

Provides a much-needed, granular taxonomy of memory capabilities beyond simple retrieval. The programmable nature and focus on state/logic within context offer a robust tool for diagnosing LLM limitations.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Language Models' ability to utilize input context (memory) to perform specific retrieval, reasoning, and synthesis tasks.

Inputs: A generated context $C$ containing information (text, lists, sets, operations) and an instruction/query $Q$.

Outputs: A response $A$ that demonstrates the correct extraction, manipulation, or tracking of information from $C$.

Pipeline Flow

Test Configuration (User selects test type, difficulty parameters)
Data Generation (Parametric program generates randomized context and queries)
Model Inference (LLM processes context and generates answer)
Evaluation (Automated scoring using Exact Match, ROUGE-L, or Jaccard)

System Modules

Test Generator

Create randomized test samples based on specific templates (Atomic or Composite)

Model or implementation: Parametric Python Scripts

Evaluated Model

Process the generated context and instruction to produce an answer

Model or implementation: Various LLMs (e.g., GPT-4o, LLaMA-3.1-8b)

Scorer

Compare model output to ground truth

Model or implementation: Deterministic Logic

Novel Architectural Elements

Programmable benchmark architecture: Tests are defined as code templates rather than static datasets, allowing dynamic generation of 'fresh' instances.
Composite test logic: Specifically designed templates that enforce dependencies between different memory segments (e.g., distinct compartments for different entities).

Modeling

Base Model: Evaluation targets: GPT-4-turbo, GPT-4o, GPT-4o-mini, Cohere-command-rplus, Mistral-7b-instruct-v02, Phi-3-small-128k, Phi-3-medium-128k, LLaMA-3.1-8b-instruct, Gemma-2-9b

Compute: Not reported in the paper

Comparison to Prior Work

vs. NIAH: Introduces state tracking, editing, and set operations rather than just static retrieval.
vs. Static Benchmarks: Uses parametric generation to avoid overfitting and allow arbitrary context scaling.
vs. RULER: Focuses specifically on isolating 'atomic' memory operations (like 'spot the difference' or 'functional update') vs. general NLP tasks.
+ 1 more
Novel contribution: First extensive framework explicitly categorizing and testing 'memory usage' capabilities (retrieval vs. reasoning vs. state tracking) individually and in composition.

Limitations

Evaluation primarily fixed at 4k context length for main comparison (though scalable).
Focuses on synthetic/structured data rather than naturalistic noisy documents.
Composite tasks are limited to specific scenarios (Data Blocks, Theory of Mind) and may not cover all real-world complexities.
Reliance on exact match/rigid metrics for some tasks might penalize valid but differently formatted answers.

Reproducibility

Code: https://github.com/microsoft/minerva_memory_test

Code and data available at https://github.com/microsoft/minerva_memory_test. The paper provides detailed templates and logic for the tests in the Appendix. Evaluation hyperparameters (context length 4k, temp 0) are specified.

📊 Experiments & Results

Evaluation Setup

Controlled generation of synthetic tasks evaluating memory capabilities.

Benchmarks:

Minerva Memory Test (Search) (Information Retrieval) [New]
Minerva Memory Test (Recall and Edit) (Content Transfer and Synthesis) [New]
Minerva Memory Test (Match and Compare) (Processing and Basic Reasoning) [New]
Minerva Memory Test (Stateful Processing) (Complex State Tracking) [New]

Metrics:

Exact Match Accuracy
ROUGE-L
Jaccard Similarity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Search tasks show that while most models handle simple retrieval well, subsequence search (finding multi-word phrases) remains challenging.
String Search (Subsequence)	Accuracy	0.55	0.90	+0.35
Recall and Edit tasks reveal that functional updates (applying a rule to change data) are much harder than simple recall.
Functional Update	ROUGE-L	1.00	0.80	-0.20
Stateful Processing tasks show the most dramatic separation between proprietary SOTA models and open weights.
Quantity State Tracking	Accuracy	0.00	1.00	+1.00
Set State Tracking	Jaccard	Not applicable	0.68	-
Composite tests cause performance to collapse across all models compared to atomic baselines.
Theory of Mind	Accuracy	0.68	0.45	-0.23

Experiment Figures

Radar chart comparing 9 models across 6 memory capability categories.

Performance curves for Stateful Processing (Quantity and Set) as the number of steps increases.

Main Takeaways

High performance on NIAH (retrieval) does not predict performance on state tracking or editing; open-source models excel at search but fail at manipulation.
Models struggle with negative constraints and comparisons (e.g., finding what is *not* present or identifying odd groups out).
Composite tasks involving boundaries between memory compartments are significantly harder than their atomic components, highlighting a 'flattening' issue where models struggle to distinguish context sources.
Increasing context length exacerbates failures in reasoning tasks (like Functional Updates) much faster than in retrieval tasks.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and context windows
Understanding of 'Needle In A Haystack' (NIAH) benchmarks
Basic concepts of state tracking and set operations

Key Terms

NIAH: Needle In A Haystack—a standard benchmark testing if a model can retrieve a specific piece of information buried in a large context.

Atomic Tests: Tests designed to evaluate individual memory capabilities in isolation, such as searching, recalling, or matching.

Composite Tests: Tests that combine multiple atomic capabilities to simulate complex, real-world scenarios involving boundaries and interactions between memory segments.

Stateful Processing: Tasks requiring the model to track the changing state of an entity (e.g., a number or a set) through a sequence of operations defined in the context.

Theory of Mind: In this specific benchmark, a composite test requiring the model to track knowledge states of different entities and information flow between them.

ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)—a metric measuring text overlap between generated and reference summaries.

Jaccard Similarity: A statistic used for comparing the similarity and diversity of sample sets (size of intersection divided by size of union).