MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

📝 Paper Summary

Agent Memory Evaluation Multi-Session Agent Tasks

MemoryArena is a benchmark for evaluating agent memory in multi-session tasks where success depends on retaining and reusing information from prior interactions, revealing that strong recall does not guarantee effective agentic action.

Core Problem

Existing benchmarks evaluate either static memorization (QA/recall) without action, or single-session agent actions where history is just flat context, failing to test if agents can actively use memory to guide future decisions.

Why it matters:

Real-world tasks span multiple sessions where early interactions introduce latent constraints (e.g., compatibility, preferences) that must be preserved for later decisions
Current agents achieve near-saturated performance on static memory benchmarks (like LoCoMo) but fail to translate this into effective decision-making in dynamic environments
Success on single-session benchmarks like SWE-Bench often relies on short-term working memory rather than persistent long-term retention

Concrete Example: In bundled web shopping, a user first buys a camera body. Later, they want a lens. A standard agent might treat the lens purchase as a new task, failing to recall the specific camera model bought earlier, resulting in the purchase of an incompatible lens.

Key Novelty

Memory-Agent-Environment Loop Evaluation

evaluates memory via interdependent subtasks where later actions are underspecified unless the agent correctly recalls information from prior sessions
introduces four domains (shopping, travel, search, reasoning) requiring the distillation of experience into memory to solve progressive constraints
shifts evaluation from passive 'recall accuracy' to active 'task completion rate' dependent on memory usage

Evaluation Highlights

Agents with near-saturated performance on static memory benchmarks perform poorly on MemoryArena, revealing a significant capability gap
Tasks involve long horizons averaging 57 action steps and produce reasoning traces exceeding 40k tokens
Current state-of-the-art agents (including RAG and long-context models) exhibit low task completion rates due to failures in maintaining latent task states

Breakthrough Assessment

9/10

Identifies a critical blind spot in current agent evaluation: the gap between passive recall and active memory usage. The interdependent multi-session design effectively simulates realistic long-horizon deployment.

⚙️ Technical Details

Problem Definition

Setting: Multi-session agentic tasks where a sequence of subtasks s_1...s_n are executed sequentially, and subtask s_i depends on information acquired in s_1...s_{i-1}

Inputs: A sequence of instructions for subtasks, where later instructions are underspecified without context from previous sessions

Outputs: Actions (e.g., web clicks, search queries, reasoning steps) culminating in task completion

Pipeline Flow

Environment Initialization (Task Instruction)
Agent Action Loop (Session 1)
Memory Update
Agent Action Loop (Session 2+)

System Modules

Agent

Selects actions based on current instruction and retrieved memory

Model or implementation: Various SOTA agents (Long-context, RAG-augmented, External Memory)

Memory System

Stores interaction history and retrieves relevant information for current subtask

Model or implementation: Varies (Long-context buffer, RAG, or Memory Agent)

Environment

Executes actions and provides feedback/observations

Model or implementation: Domain-specific simulators (Shopping, TravelPlanner, BrowseComp, Math/Physics)

Novel Architectural Elements

Evaluation framework enforcing strict causal dependencies between sessions (e.g., Session 2 is unsolvable without specific facts from Session 1)
Integration of four distinct domain environments into a unified multi-session memory loop structure

Modeling

Base Model: Evaluates various models (Specific model names not listed in text provided, refers to 'state-of-the-art agents')

Comparison to Prior Work

vs. LoCoMo/LongMemEval: MemoryArena requires active use of memory to guide actions in a dynamic loop, not just static recall
vs. SWE-Bench/WebArena: MemoryArena enforces multi-session dependencies where history exceeds context windows, requiring persistent memory strategies rather than flat context
vs. Mem2ActBench [not cited in paper]: MemoryArena focuses on interdependent subtasks where skills/knowledge must be distilled, rather than just retrieving facts from static traces

Limitations

Requires complex environment setup for all four domains
Evaluation is time-consuming due to long horizons (avg 57 steps)
Focuses on text/code/web actions, excluding multimodal physical embodiment

Reproducibility

Code: https://memoryarena.github.io/

Benchmark released at https://memoryarena.github.io/. Data creation process detailed, including use of expert annotators for math/physics and automated filtering for web tasks. Specific model weights for evaluated agents not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Multi-session agentic tasks across four domains requiring persistent memory

Benchmarks:

Bundled Web Shopping (Sequential product purchasing with compatibility constraints) [New]
Preference-Constrained Group Travel Planning (Iterative itinerary planning with joining members) [New]
Progressive Information Searching (Multi-step search where queries build on prior results) [New]
Sequential Formal Reasoning (Math/Physics proofs where claims depend on prior lemmas) [New]

Metrics:

Task Completion Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MemoryArena (All Domains)	Task Completion Rate	100	Low (Qualitative)	Large Gap

Main Takeaways

Agents that saturate static memory benchmarks (like LoCoMo) fail in MemoryArena, indicating a gap between 'memorization' and 'actionable memory'.
Current agents struggle to maintain and exploit latent task states across sessions.
Success in single-session benchmarks (SWE-Bench) does not predict success in multi-session interdependent tasks.
The 'Memory-Agent-Environment loop' is a more rigorous test of agent capabilities than isolated recall or single-session action.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM agents and tool use
Familiarity with Retrieval-Augmented Generation (RAG)
Basic concepts of long-context language models

Key Terms

Memory-Agent-Environment loop: A process where actions elicit feedback, feedback updates memory, and memory conditions subsequent actions across multiple sessions

Interdependent subtasks: Tasks where the correct execution of a later step requires information or constraints established in an earlier step

Latent constraints: Requirements (like product compatibility or user preferences) that are not explicitly restated in current instructions but must be recalled from history

LoCoMo: A benchmark for evaluating long-context modeling and memorization through retrieval tasks

RAG: Retrieval-Augmented Generation—systems that retrieve relevant documents to ground model generation

Action space: The set of all possible actions an agent can take in an environment (e.g., click, type, search)

Reasoning trace: The sequence of thoughts and actions an agent generates while solving a problem