Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

📝 Paper Summary

Memory organization Self-evolving Agentic reasoning

Dynamic Cheatsheet improves LLM performance on recurring tasks by maintaining a self-curated, evolving textual memory of strategies and code snippets without modifying model parameters.

Core Problem

LLMs typically process queries in isolation, resetting their context for each new problem, which causes them to repeatedly re-derive solutions or repeat the same mistakes instead of learning from experience.

Why it matters:

Models fail to carry over successful strategies (like efficient code scripts) to subsequent similar problems, leading to stagnant performance
Naive approaches like appending full conversation history result in context ballooning and noise, distracting the model rather than helping it
Current methods to fix this require expensive fine-tuning or static retrieval, which lack the flexibility to adapt on the fly to new test distributions

Concrete Example: In the 'Game of 24' puzzle, GPT-4o initially fails (10% accuracy) by manually guessing arithmetic combinations. However, once it discovers a Python brute-force script, DC allows it to store this snippet in memory. For all subsequent queries, it simply retrieves and executes the code, jumping to 99% accuracy—something isolated inference never achieves.

Key Novelty

Dynamic Cheatsheet (DC)

Treats the LLM's context as a mutable 'cheatsheet' that is explicitly curated after every interaction
Introduces a 'Curator' step that summarizes successful strategies and removes failed ones, preventing context overflow while retaining high-value heuristics
Enables 'Test-Time Learning' where a frozen model effectively gets smarter over a sequence of tasks by refining its external memory buffer

Architecture

The workflow of the Dynamic Cheatsheet (DC-Cu) framework showing the interaction between the Generator and the Curator.

Evaluation Highlights

Accuracy on Game of 24 (GPT-4o) increased from 10% to 99% by retaining a Python solver script
Claude 3.5 Sonnet's accuracy on AIME 2024 math exams more than doubled (23% to 50%) by accumulating algebraic insights
Achieved a +9% improvement on GPQA-Diamond (science QA) with Claude 3.5 Sonnet by recalling domain-specific formulas and facts

Breakthrough Assessment

8/10

Demonstrates massive gains on reasoning tasks without parameter updates. The shift from 'stateless inference' to 'stateful, curated memory' is a significant practical advance for deploying agents.

⚙️ Technical Details

Problem Definition

Setting: Online test-time learning over a sequence of inputs

Inputs: A sequence of queries x_1, x_2, ..., x_n sampled from distribution D_test

Outputs: A sequence of answers y_1, y_2, ..., y_n generated sequentially, conditioning on evolving memory M

Pipeline Flow

Retrieval (Optional, DC-RS only): Fetch relevant past examples
Curation (Pre-Generation): Update memory with retrieved context
Generation: Produce answer using current memory
Curation (Post-Generation, DC-Cu): Update memory with new insight

System Modules

Retriever

Identify top-k most similar past input-output pairs to the current query (used in DC-RS variant)

Model or implementation: text-embedding-3-small (OpenAI)

Curator

Synthesize new observations into the memory, refine existing entries, and remove obsolete heuristics to keep memory compact

Model or implementation: Same as Generator (e.g., GPT-4o or Claude 3.5 Sonnet)

Generator

Generate the solution to the current problem conditioning on the curated memory

Model or implementation: GPT-4o or Claude 3.5 Sonnet

Novel Architectural Elements

Explicit 'Curator' module in the inference loop that rewrites the prompt context (memory) dynamically
Two-variant workflow: DC-Cu (Cumulative update after answer) vs DC-RS (Retrieval and Synthesis before answer)

Modeling

Base Model: Claude 3.5 Sonnet and GPT-4o (main experiments); GPT-4o-mini and Claude 3.5 Haiku (ablations)

Comparison to Prior Work

vs. Dynamic Evaluation: DC updates a textual buffer, not weights, making it compatible with black-box APIs
vs. RAG: DC's memory is self-generated and evolves online, rather than being a static external database
vs. Full-History Appending (Context Window): DC actively prunes and synthesizes information to prevent context ballooning and distraction
+ 1 more
vs. Reflexion [not cited in paper]: Reflexion focuses on correcting a single problem via self-reflection loops; DC focuses on carrying over learned strategies to *future* problems in a sequence

Limitations

Relies on the model's ability to self-verify; if the model hallucinates a wrong 'successful' strategy, it poisons the memory
Smaller models (e.g., GPT-4o-mini) struggle to effectively curate memory, showing limited or no gains
Performance depends on the task sequence; requires recurring patterns or transferable heuristics to be effective

Reproducibility

Code: http://github.com/suzgunmirac/dynamic-cheatsheet

Code is publicly available at http://github.com/suzgunmirac/dynamic-cheatsheet. The paper uses closed-source API models (GPT-4o, Claude 3.5 Sonnet), so exact reproduction depends on API stability. All data splits (AIME 2024/2025, Game of 24) are standard or released.

📊 Experiments & Results

Evaluation Setup

Sequential problem solving where the model processes questions one by one and builds memory

Benchmarks:

AIME 2024 / 2025 (Challenging Mathematics (Algebra, Combinatorics))
Game of 24 (Algorithmic / Arithmetic Search)
GPQA-Diamond (Graduate-Level Science QA)
Math Equation Balancer (Arithmetic operator placement) [New]

Metrics:

Accuracy (Functionally Correct for Math/Code tasks)
Soft Match (for Multiple Choice QA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math and Algorithmic Reasoning: DC enables models to discover and reuse code-based strategies, leading to near-perfect scores on algorithmic tasks and massive gains on competition math.
Game of 24	Accuracy	10	99	+89
AIME 2024	Accuracy	23	50	+27
AIME 2025	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+30
Math Equation Balancer	Accuracy	45	100	+55
Knowledge-Intensive Tasks: DC improves performance by recalling specific domain knowledge and formulas.
GPQA-Diamond	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+9
MMLU-Pro (Eng/Physics)	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+8

Experiment Figures

Bar chart comparing Baselines, DC-Empty, and DC variants (DC-Cu, DC-RS) across AIME and Game of 24 benchmarks.

Main Takeaways

Test-time learning is highly effective for 'insight-based' tasks (like Game of 24) where a single strategy (using Python) solves the whole benchmark.
DC is superior to naive full-history appending because it actively synthesizes and prunes information, preventing noise accumulation.
The method is 'rich-get-richer': capable models (Claude 3.5, GPT-4o) improve significantly, while smaller models (GPT-4o-mini) fail to generate the initial good solutions needed to populate the memory.
Improvements are observed across both reasoning-heavy (AIME) and knowledge-heavy (GPQA) domains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and In-Context Learning (ICL)
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of online learning concepts

Key Terms

DC: Dynamic Cheatsheet—the proposed framework for maintaining an evolving textual memory during inference

Test-time learning: Improving model performance on a test set during the inference phase, without updating model weights (backpropagation)

AIME: American Invitational Mathematics Examination—a challenging high-school math competition benchmark

GPQA-Diamond: A graduate-level science QA benchmark (Google-Proof Q&A) heavily validated by experts

MMLU-Pro: A more difficult version of the Massive Multitask Language Understanding benchmark

Zero-shot CoT: Zero-shot Chain-of-Thought—prompting a model to 'think step by step' without providing examples

Context ballooning: The issue where the amount of text in the model's prompt grows indefinitely, exceeding limits or adding noise