AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

📝 Paper Summary

Agentic Memory Long-Horizon Evaluation Memory Retrieval

AMA-Bench evaluates agent memory using machine-generated, causally dependent trajectories, revealing that standard RAG fails due to lossy compression, prompting a new graph-based agent (AMA-Agent) that preserves causality.

Core Problem

Existing memory benchmarks focus on dialogue-centric human-agent interactions, ignoring the machine-generated, symbol-heavy, and causally constrained nature of real-world autonomous agent tasks.

Why it matters:

Real-world agents operate in environments like databases and code executors where data is machine-generated (JSON, SQL) rather than natural language
Current benchmarks lack causality; in agent tasks, actions constrain future states, but dialogue benchmarks often follow unconstrained linguistic flows
Dialog-centric benchmarks contain redundant 'chit-chat', whereas agent trajectories are dense and objective, making lossy compression techniques harmful

Concrete Example: In a TextWorld game, an agent might pick up a key in step 5 that is needed in step 50. Standard similarity-based RAG might fail to retrieve the 'pickup' action because the query 'open door' doesn't lexically overlap with the raw log of the pickup event, or compression might summarize away the specific key type.

Key Novelty

AMA-Bench (Agent Memory with Any length) & AMA-Agent

Benchmarks agent memory using two subsets: a 'Real-world' set of expert-annotated logs from domains like SQL/Web, and a 'Synthetic' set enabling infinite scaling of context length via programmatic environments
Proposes AMA-Agent, which replaces similarity-based storage with a Causality Graph that preserves state transitions and uses Tool-Augmented Retrieval (keyword + graph search) to handle machine-generated symbols

Architecture

The conceptual framework of the memory system in an agent loop and the capability taxonomy.

Evaluation Highlights

AMA-Agent achieves 57.22% average accuracy on AMA-Bench, outperforming the strongest memory system baselines by 11.16%
Frontier model GPT-5.2 achieves only 72.26% accuracy on the benchmark, indicating significant room for improvement even for long-context models
Existing memory systems (like MemoRAG and vector RAG) significantly underperform long-context baselines in long-horizon agentic tasks due to lossy compression

Breakthrough Assessment

8/10

Identifies a critical gap (agent vs. dialogue memory) and provides both a comprehensive benchmark and a novel graph-based solution that significantly outperforms RAG baselines.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) defined by tuple M=(S, A, O, P, r), where agents process interaction history h_t to query memory

Inputs: Interaction history h_t = (x, a_1, o_1, ..., o_t) and a query q_t

Outputs: A relevant context c_t retrieved from memory to guide the next action a_t

Pipeline Flow

Memory Construction (Build Causality Graph)
Memory Retrieval (Tool-Augmented Search)
Response Generation

System Modules

Causality Graph Builder

Converts linear trajectory into a graph where nodes are (action, observation) pairs and edges represent temporal/causal sequence

Model or implementation: Rule-based graph construction

Tool-Augmented Retriever

Retrieves relevant nodes using a mix of graph traversal and keyword search

Model or implementation: Hybrid (Graph traversal + BM25/Regex tools)

Agent Policy

Generates the answer or next action based on retrieved context

Model or implementation: GPT-4o (or other LLM backbone)

Novel Architectural Elements

Causality Graph storage structure specifically for agent trajectories (preserving sequential state dependencies rather than chunking text)
Hybrid retrieval combining structural graph navigation with explicit keyword search tools (addressing the 'machine-generated representation' gap)

Modeling

Base Model: GPT-4o (used as the backbone for the AMA-Agent and evaluations)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GraphRAG: Uses a trajectory-based Causality Graph rather than an entity-based knowledge graph, preserving temporal state transitions
vs. MemoRAG: Avoids lossy compression/summarization of memory, which fails on dense machine-generated logs
vs. Vector RAG (BM25/Embedding): Uses tool-augmented retrieval (exact match/symbolic search) instead of purely semantic similarity, which fails on code/JSON symbols
+ 1 more
vs. LangChain [not cited in paper]: Focuses specifically on long-horizon causal dependencies in agent logs rather than generic document chaining

Limitations

Real-world environments are treated as black boxes, limiting access to ground-truth backend states for some evaluations
Synthetic subset relies on specific domains (TextWorld, BabyAI) which may not fully capture all real-world agent complexities
Evaluation focuses on QA performance on logs rather than live execution performance improvement (though QA is a proxy)

Reproducibility

Code: https://github.com/S-Agents/AMA-Bench

Publicly available: Code, Dataset, and Leaderboard (https://github.com/S-Agents/AMA-Bench). The paper describes the synthetic environment generation parameters (BabyAI, TextWorld) in detail.

📊 Experiments & Results

Evaluation Setup

QA over long-horizon agent interaction logs

Benchmarks:

AMA-Bench (Real-world) (QA on logs from Web, Text2SQL, Software Engineering, Gaming, Embodied AI) [New]
AMA-Bench (Synthetic) (QA on logs from TextWorld and BabyAI with controllable length) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on AMA-Bench showing AMA-Agent superiority over existing memory systems.
AMA-Bench (Average)	Accuracy	46.06	57.22	+11.16
AMA-Bench (Average)	Accuracy	45.13	57.22	+12.09
AMA-Bench (Average)	Accuracy	34.50	57.22	+22.72
Comparison against long-context models, showing memory systems still lag behind full-context processing.
AMA-Bench (Average)	Accuracy	69.15	57.22	-11.93
AMA-Bench (Average)	Accuracy	72.26	57.22	-15.04

Experiment Figures

Radar chart comparing average accuracy of different memory systems (AMA-Agent, MemoRAG, RAG, etc.) and Long-Context models on AMA-Bench.

Main Takeaways

Existing memory systems (RAG, MemoRAG, MemGPT) significantly underperform simple long-context baselines on agentic tasks, unlike in dialogue tasks where they often excel.
Lossy compression and similarity-based retrieval are primary bottlenecks because agent logs contain dense, objective machine-generated symbols where 'gist' summarization discards critical details.
AMA-Agent's Causality Graph and Tool-Augmented Retrieval bridge the gap, outperforming baselines by ~11%, though still trailing full-context models.
Frontier commercial models (GPT-5.2) only achieve ~72% accuracy, proving that long-horizon agent memory remains an unsolved challenge.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP/POMDP)
Retrieval-Augmented Generation (RAG)
Agentic workflows (Reason and Act)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly see the full state of the world

NIAH: Needle-In-A-Haystack—an evaluation method where a specific fact (needle) is hidden in a large amount of unrelated text (haystack) to test retrieval

Causality Graph: A data structure proposed in AMA-Agent that represents interaction history as nodes (actions/observations) linked by directed edges representing temporal and causal dependencies

Tool-Augmented Retrieval: A hybrid retrieval mechanism in AMA-Agent that combines graph traversal with keyword-based search tools (like grep) to find exact matches in machine-generated text

TextWorld: A text-based game generator used for synthetic environments where agents interact via text commands

BabyAI: A grid-world environment used for testing agents on instruction-following tasks

BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query