← Back to Paper List

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley
arXiv.org (2025)
Memory Agent Benchmark RAG QA

📝 Paper Summary

Memory Mechanisms in LLM Agents Agentic Benchmarks
MemoryAgentBench evaluates LLM agents on four core memory competencies—retrieval, learning, understanding, and forgetting—by feeding information incrementally to simulate realistic multi-turn interactions.
Core Problem
Existing benchmarks rely on static, one-shot long contexts that fail to test how agents incrementally accumulate, update, and retrieve information over time.
Why it matters:
  • Real-world agents process context incrementally (stream of interactions) rather than as a single static block
  • Standard long-context evaluations do not assess specific memory skills like selective forgetting (updating old facts) or test-time learning
  • Current benchmarks like LongBench or LooGLE focus on context window capacity, not the abstraction and consolidation processes required for effective agent memory
Concrete Example: In the FactConsolidation task, an agent receives a fact (e.g., a tool's country of origin) and later receives a contradictory update. Static benchmarks might feed both at once or ask for the latest without testing the update mechanism, whereas this benchmark feeds the update sequentially, requiring the agent to overwrite the old memory to answer correctly.
Key Novelty
MemoryAgentBench Framework
  • Identifies four distinct memory competencies: Accurate Retrieval (AR), Test-Time Learning (TTL), Long-Range Understanding (LRU), and Selective Forgetting (SF)
  • Transforms static long-context datasets into incremental multi-turn streams where agents must absorb chunks sequentially before answering questions
  • Introduces two new datasets: EventQA for temporal reasoning in narratives and FactConsolidation for evaluating memory updates and conflict resolution
Architecture
Architecture Figure Figure 1 (implied)
Conceptual diagram of the four memory competencies (Accurate Retrieval, Test-Time Learning, Long-Range Understanding, Selective Forgetting)
Breakthrough Assessment
7/10
Establishes a necessary distinction between 'long context' and 'agent memory,' providing a structured framework and new datasets (EventQA, FactConsolidation) to evaluate the latter, though performance results are not in the provided text.
×