← Back to Paper List

Evaluating Memory Structure in LLM Agents

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin
Yandex Research
arXiv (2026)
Memory Agent Benchmark RAG

📝 Paper Summary

Memory organization Agentic RAG pipeline
StructMemEval is a benchmark demonstrating that while LLM agents can organize memory into complex structures like trees and ledgers, they typically fail to do so unless explicitly prompted with structural hints.
Core Problem
Existing long-term memory benchmarks focus on simple factual recall which can be solved by basic retrieval, failing to test if agents can actually organize knowledge into necessary hierarchies or structures.
Why it matters:
  • Complex tasks like genealogy or financial accounting require maintaining specific data structures (trees, ledgers) rather than just retrieving isolated past messages
  • Simple retrieval systems (RAG) often take messages out of context, failing to track state changes (e.g., location updates) over long histories
  • Current evaluations do not distinguish between an agent's ability to recall a fact vs. its ability to maintain a coherent knowledge base over time
Concrete Example: In a state tracking scenario where a user interacts with neighbors, moves to a new city, and interacts with new people, a simple retrieval system answers questions about 'neighbors' by retrieving old, irrelevant interactions from the first city because it ignores the state change (moving).
Key Novelty
StructMemEval: A Structure-Centric Memory Benchmark
  • Evaluates agents on tasks requiring specific memory organizations (Trees, State Tracking, Counting) rather than just unstructured retrieval
  • Introduces 'memory organization hints'—explicit prompts describing how to structure data—to differentiate between an agent's inability to use memory tools vs. its failure to recognize the need for structure
Evaluation Highlights
  • Memory agents (Mem0, Mem-agent) significantly outperform simple retrieval-augmented LLMs on structural tasks when provided with organization hints
  • Retrieval-augmented baselines fail on state-tracking tasks because they retrieve semantically relevant but factually outdated information (e.g., pre-move neighbors)
  • A significant performance gap exists between memory agents with hints versus without hints, indicating modern LLMs struggle to spontaneously identify necessary memory structures
Breakthrough Assessment
7/10
Identifies a crucial blind spot in current agent evaluation (structure vs. recall) and provides a diagnostic tool (hints) to isolate failure modes. However, results are primarily diagnostic rather than proposing a new model solution.
×