← Back to Paper List

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao
University of California, San Diego, Meta AI
arXiv (2026)
Memory Agent Benchmark RAG

📝 Paper Summary

Agentic Memory Long-Horizon Evaluation Memory Retrieval
AMA-Bench evaluates agent memory using machine-generated, causally dependent trajectories, revealing that standard RAG fails due to lossy compression, prompting a new graph-based agent (AMA-Agent) that preserves causality.
Core Problem
Existing memory benchmarks focus on dialogue-centric human-agent interactions, ignoring the machine-generated, symbol-heavy, and causally constrained nature of real-world autonomous agent tasks.
Why it matters:
  • Real-world agents operate in environments like databases and code executors where data is machine-generated (JSON, SQL) rather than natural language
  • Current benchmarks lack causality; in agent tasks, actions constrain future states, but dialogue benchmarks often follow unconstrained linguistic flows
  • Dialog-centric benchmarks contain redundant 'chit-chat', whereas agent trajectories are dense and objective, making lossy compression techniques harmful
Concrete Example: In a TextWorld game, an agent might pick up a key in step 5 that is needed in step 50. Standard similarity-based RAG might fail to retrieve the 'pickup' action because the query 'open door' doesn't lexically overlap with the raw log of the pickup event, or compression might summarize away the specific key type.
Key Novelty
AMA-Bench (Agent Memory with Any length) & AMA-Agent
  • Benchmarks agent memory using two subsets: a 'Real-world' set of expert-annotated logs from domains like SQL/Web, and a 'Synthetic' set enabling infinite scaling of context length via programmatic environments
  • Proposes AMA-Agent, which replaces similarity-based storage with a Causality Graph that preserves state transitions and uses Tool-Augmented Retrieval (keyword + graph search) to handle machine-generated symbols
Architecture
Architecture Figure Figure 4
The conceptual framework of the memory system in an agent loop and the capability taxonomy.
Evaluation Highlights
  • AMA-Agent achieves 57.22% average accuracy on AMA-Bench, outperforming the strongest memory system baselines by 11.16%
  • Frontier model GPT-5.2 achieves only 72.26% accuracy on the benchmark, indicating significant room for improvement even for long-context models
  • Existing memory systems (like MemoRAG and vector RAG) significantly underperform long-context baselines in long-horizon agentic tasks due to lossy compression
Breakthrough Assessment
8/10
Identifies a critical gap (agent vs. dialogue memory) and provides both a comprehensive benchmark and a novel graph-based solution that significantly outperforms RAG baselines.
×