MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

📝 Paper Summary

Memory in LLMs Agent evaluation

MemBench is a comprehensive benchmark for evaluating LLM agents' memory that introduces reflective memory tasks and distinct observation versus participation scenarios to better mirror real-world agent usage.

Core Problem

Existing memory evaluations for LLM agents are limited to factual recall in direct participation scenarios, neglecting high-level reflective reasoning (inferring preferences) and passive observation scenarios.

Why it matters:

Real-world agents must not only recall explicit facts but also infer implicit user preferences (reflective memory) to provide personalized assistance.
Agents often operate in observation modes (passively recording user streams) which differ fundamentally from participation modes (active dialogue) where the agent's own responses pollute the context.
Current metrics focus on effectiveness (accuracy) while ignoring efficiency (time cost) and capacity limits, which are critical for deployment.

Concrete Example: A user might say "I love spicy food" (factual). Later, they might praise specific dishes like "Sichuan hotpot" and "Mapo Tofu." Current benchmarks test if the agent remembers "Sichuan hotpot," but fail to test if the agent infers the high-level preference "User likes Sichuan cuisine" (reflective), which is crucial for future recommendations.

Key Novelty

Multi-Level, Multi-Scenario Memory Evaluation Framework

Introduces 'Reflective Memory' as a distinct evaluation tier: testing the agent's ability to abstract high-level preferences (e.g., taste in movies) from low-level behaviors, rather than just recalling explicit facts.
Distinguishes between 'Participation' (interactive dialogue) and 'Observation' (passive message stream) scenarios to isolate memory capabilities from the agent's generation/reasoning modules.
Constructs a massive dataset using user relation graphs derived from real recommendation datasets (MovieLens, Food, Goodreads) to ensure realistic preference distributions.

Evaluation Highlights

Traditional memory mechanisms like MemoryBank and GenerativeAgent struggle significantly with reflective memory compared to factual memory.
Vector-retrieval based memory shows a sharp decline in performance as memory context length increases, revealing capacity limitations not seen in full-context models.
The benchmark includes over 500 user profiles and scales up to 100k tokens per test session to stress-test long-term memory limits.

Breakthrough Assessment

7/10

Strong contribution in formalizing 'reflective memory' and 'observation scenarios,' addressing a clear gap in agent evaluation. While it doesn't propose a new model, the benchmark is likely to become a standard for future memory research.

⚙️ Technical Details

Problem Definition

Setting: Evaluating an agent's ability to store, update, and recall information over long contexts across different interaction modes.

Inputs: A stream of user-agent interactions (Participation) or user messages (Observation) containing facts and implicit preferences, followed by a multiple-choice question.

Outputs: The selected option for the question based on the accumulated memory.

Pipeline Flow

User Relation Graph Sampling (create profiles/entities)
Memory Dataset Construction (generate dialogues/streams)
Evaluation (Benchmarking specific memory mechanisms)

System Modules

Relation Graph Sampler (Data Generation)

Constructs user profiles and entity relationships (people, events, items) using recommendation datasets (MovieLens, etc.) to ensure realistic distributions.

Model or implementation: GPT-4o-mini (used for summarization/generation)

Dialogue Generator (Data Generation)

Expands relation graphs into natural language dialogues or message streams.

Model or implementation: Self-dialogue method (likely LLM-based, model unspecified)

Novel Architectural Elements

Dual-scenario generation pipeline: Explicitly separates data generation for 'Participation' (user-agent dialogue) and 'Observation' (user monologue) to isolate memory encoding challenges.
Hierarchical attribute mapping: Systematically maps high-level preferences (reflective) to low-level evidence (factual) to create ground truth for inference tasks.

Modeling

Base Model: Qwen2.5-7B (used as the base model for the agent applications in the benchmark)

Comparison to Prior Work

vs. LoCoMo: MemBench adds observation scenarios and specifically targets reflective memory (implicit preferences) rather than just factual recall.
vs. LongMemEval: MemBench includes both participation and observation scenarios; LongMemEval lacks observation scenarios.
vs. MemSim: MemBench expands MemSim to include participation scenarios and reflective memory tasks.

Limitations

Reliance on synthetic data generation (via GPT-4o-mini) rather than real human-agent interaction logs, which might miss nuances of human unpredictability.
The evaluation is limited to multiple-choice questions to ensure automated scoring, which may not capture the full complexity of open-ended generation.
Reflective memory ground truth is derived from recommendation datasets or LLM summarization, which inherently assumes the source data/model is the 'gold standard' for preference.

Reproducibility

Code: https://github.com/import-myself/Membench

Dataset and project code released at https://github.com/import-myself/Membench. The paper details the data construction process using existing public datasets (MovieLens, etc.) and GPT-4o-mini. Specific prompts for generation are not explicitly listed in the main text but implied to be in the repo.

📊 Experiments & Results

Evaluation Setup

Agents process long-context interaction histories (simulated time flow) and answer multiple-choice questions requiring memory recall or reasoning.

Benchmarks:

MemBench-Participation (Dialogue interaction memory (Factual & Reflective)) [New]
MemBench-Observation (Passive stream memory (Factual & Reflective)) [New]

Metrics:

Memory Accuracy (Accuracy of multiple-choice answers)
Memory Recall (Retrieval accuracy for retrieval-based methods)
Memory Capacity (Performance threshold vs. content volume)
Memory Efficiency (Time cost)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the 100k token dataset (Sub-dataset 2) shows significant degradation for retrieval-based methods compared to full-context methods.
MemBench (Sub-dataset 2, 100k tokens)	Accuracy	0.85	0.35	-0.50
MemBench (Sub-dataset 2, 100k tokens)	Accuracy	0.85	0.42	-0.43
Reflective vs. Factual Memory performance gap on standard size dataset (Sub-dataset 1).
MemBench (Sub-dataset 1)	Accuracy	0.62	0.48	-0.14

Main Takeaways

Reflective memory tasks (inferring preferences) are significantly harder than factual memory tasks for current agent architectures.
Observation scenarios (passive) and Participation scenarios (active) present different challenges; agents optimized for one may not generalize to the other.
Current retrieval-based memory systems (RAG) suffer severe performance degradation in very long contexts (100k+) compared to full-context ingestion, highlighting a 'capacity' bottleneck.
Most existing benchmarks ignore the computational efficiency of memory, which MemBench highlights as a critical trade-off.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based Agents
Knowledge of RAG (Retrieval-Augmented Generation)
Familiarity with long-context evaluation

Key Terms

Reflective Memory: A high-level memory type involving information not explicitly stated but implicitly inferred, such as a user's general taste preferences derived from specific item ratings.

Factual Memory: A low-level memory type involving explicitly provided information, such as specific dates, names, or stated preferences.

Participation Scenario: A scenario where the agent actively interacts with the user (dialogue), requiring it to remember both user inputs and its own generated responses.

Observation Scenario: A scenario where the agent passively receives a stream of user messages (third-person perspective) without generating responses.

MemSim: A prior framework for simulating memory data, which this paper expands upon by adding reflective memory and participation scenarios.

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents/memories.

GenerativeAgent: A specific agent architecture (from Park et al.) utilizing a memory stream with retrieval and reflection mechanisms.