← Back to Paper List

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, Alex Pentland
University of California, San Diego, University of Washington, Massachusetts Institute of Technology
arXiv (2026)
Memory Agent Benchmark Reasoning

📝 Paper Summary

Agent Memory Evaluation Multi-Session Agent Tasks
MemoryArena is a benchmark for evaluating agent memory in multi-session tasks where success depends on retaining and reusing information from prior interactions, revealing that strong recall does not guarantee effective agentic action.
Core Problem
Existing benchmarks evaluate either static memorization (QA/recall) without action, or single-session agent actions where history is just flat context, failing to test if agents can actively use memory to guide future decisions.
Why it matters:
  • Real-world tasks span multiple sessions where early interactions introduce latent constraints (e.g., compatibility, preferences) that must be preserved for later decisions
  • Current agents achieve near-saturated performance on static memory benchmarks (like LoCoMo) but fail to translate this into effective decision-making in dynamic environments
  • Success on single-session benchmarks like SWE-Bench often relies on short-term working memory rather than persistent long-term retention
Concrete Example: In bundled web shopping, a user first buys a camera body. Later, they want a lens. A standard agent might treat the lens purchase as a new task, failing to recall the specific camera model bought earlier, resulting in the purchase of an incompatible lens.
Key Novelty
Memory-Agent-Environment Loop Evaluation
  • evaluates memory via interdependent subtasks where later actions are underspecified unless the agent correctly recalls information from prior sessions
  • introduces four domains (shopping, travel, search, reasoning) requiring the distillation of experience into memory to solve progressive constraints
  • shifts evaluation from passive 'recall accuracy' to active 'task completion rate' dependent on memory usage
Evaluation Highlights
  • Agents with near-saturated performance on static memory benchmarks perform poorly on MemoryArena, revealing a significant capability gap
  • Tasks involve long horizons averaging 57 action steps and produce reasoning traces exceeding 40k tokens
  • Current state-of-the-art agents (including RAG and long-context models) exhibit low task completion rates due to failures in maintaining latent task states
Breakthrough Assessment
9/10
Identifies a critical blind spot in current agent evaluation: the gap between passive recall and active memory usage. The interdependent multi-session design effectively simulates realistic long-horizon deployment.
×