← Back to Paper List

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du
arXiv (2026)
Memory Agent RAG Benchmark

📝 Paper Summary

Agentic AI Long-term Memory
This survey formalizes agent memory as a write-manage-read loop and unifies diverse approaches into a three-dimensional taxonomy covering temporal scope, storage substrate, and control policy.
Core Problem
Stateless LLM agents cannot persist information across long horizons, leading them to repeat mistakes, rediscover known facts, and fail at tasks requiring continuity.
Why it matters:
  • Without memory, agents cannot learn from experience, causing them to retry actions that previously failed (e.g., crashing a build twice)
  • Standard context windows are insufficient for capturing multi-session history and what has been learned over weeks of operation
  • Current memory designs are fragmented, lacking a unified framework to compare heuristics, vector stores, and learned policies
Concrete Example: A debugging assistant without memory works on a codebase for a week. On Monday, it discovers the directory layout. On Friday, it crashes the build. The next Monday, it has forgotten the layout (wasting time rediscovering it) and retries the exact fix that crashed the build on Friday.
Key Novelty
Unified 3D Taxonomy of Agent Memory
  • Formalizes memory not as storage but as a 'write-manage-read' loop within a POMDP (Partially Observable Markov Decision Process) cycle
  • Classifies systems by Temporal Scope (working, episodic, semantic, procedural), Substrate (text, vector, structured, executable), and Control Policy (heuristic, prompted, learned)
  • Identifies the 'transition policy' (how episodic records consolidate into semantic rules) as the critical architectural decision
Evaluation Highlights
  • Voyager agents with procedural memory (skill library) achieve 15.3x faster tech-tree milestone completion compared to agents without it
  • Reflexion agents using verbal self-critique memory achieve 91% pass@1 on HumanEval, compared to 80% for the GPT-4 baseline
  • MemoryArena benchmark shows active memory agents achieve ~80% task completion on interdependent tasks, while long-context baselines drop to ~45%
Breakthrough Assessment
9/10
A definitive survey that brings order to a chaotic field. The formalization of the memory loop and the 3D taxonomy provide a crucial theoretical grounding for future agent research.
×