University of California San Diego,
Stanford University
arXiv
(2025)
MemoryAgentRLRAGBenchmark
📝 Paper Summary
Memory organizationAgentic memory managementReinforcement Learning for Agents
Mem-α uses reinforcement learning to train LLM agents to actively manage a complex, multi-component memory system, optimizing directly for downstream question-answering accuracy rather than relying on fixed rules.
Core Problem
Current memory-augmented agents rely on fixed, pre-defined rules or prompts to update memory, but LLMs often fail to determine what to store, how to structure it, or when to update it effectively, especially in complex scenarios.
Why it matters:
Pre-defined rules are brittle and cannot adapt to diverse interaction patterns, leading to information loss or bloated memory
Even state-of-the-art models like GPT-4o struggle to spontaneously select the correct tools for complex memory updates without explicit training
Small models with weak instruction-following capabilities get overwhelmed by complex memory tool sets, making effective long-term memory inaccessible to efficient models
Concrete Example:When an agent receives a long stream of information containing a mix of casual conversation, storytelling, and factual documents, a rule-based system might save everything (overflowing context) or miss subtle plot points. Mem-α learns to specifically extract only the facts necessary to answer future questions while discarding noise.
Key Novelty
Reinforcement Learning for Active Memory Construction
Formulates memory management as a sequential decision-making problem where the agent decides how to update Core, Episodic, and Semantic memory chunks
Optimizes memory construction directly against downstream QA performance (RAG accuracy) rather than supervising the memory trace itself, allowing the agent to discover its own optimal storage strategies
Achieves massive length generalization: trained on 30k token sequences but generalizes to >400k tokens
Architecture
The memory architecture and interaction flow. It displays the three memory components (Core, Semantic, Episodic) and the allowed operations for each.
Evaluation Highlights
Generalizes to sequences exceeding 400k tokens (13× the max training length of 30k) while maintaining high retrieval accuracy
Outperforms existing memory baselines (including MemGPT and Mem0) across diverse interaction patterns
Demonstrates that RL enables agents to learn fundamental memory principles (what to keep/discard) rather than just memorizing patterns
Breakthrough Assessment
8/10
Significant advance in making memory agents 'active' learners rather than passive rule-followers. The 13x length generalization from training to inference is a particularly strong result for RL-based methods.
⚙️ Technical Details
Problem Definition
Setting: Sequential decision-making for memory construction over a stream of conversation chunks
Inputs: A sequence of conversation chunks C = {c_1, ..., c_n}
Outputs: A sequence of memory write actions A = {a_1, ..., a_n} resulting in a final memory state M_n
Purpose: Reward Semantic Validity (verified by external model).
Formally: r4 = fraction of semantically valid updates (checked by Qwen3-32b)
Training Data:
4,139 total instances spanning diverse patterns (Conversation, Document, Pattern, Story)
Stratified sampling used to create a balanced subset of 562 instances for RL training
Key Hyperparameters:
max_training_length: 30,000 tokens
reward_weights: Tunable parameters β and γ in reward function
Compute: Not reported in the paper
Comparison to Prior Work
vs. MemGPT: Mem-α learns the update policy via RL instead of relying on prompt engineering; MemGPT is static.
vs. Memory-R1: Mem-α handles a complex 3-part memory structure (Core/Episodic/Semantic) capable of evolving knowledge, whereas Memory-R1 uses simpler text-only memory or simple lists.
vs. SELF-PARAM [not cited in paper]: SELF-PARAM internalizes memory into weights; Mem-α uses external explicit storage, allowing for infinite capacity and editability.
vs. MIRIX: Mem-α trains the agent to use tools, whereas MIRIX expects the model to use complex tools zero-shot (which fails for smaller models).
Limitations
Conflict resolution (handling contradictory information) was excluded from evaluation due to lack of realistic benchmarks.
Reliance on a fixed retriever (BM25) during training means the memory structure is optimized specifically for lexical overlap, potentially limiting semantic retrieval capabilities.
Computational overhead of RL training on long sequences necessitated using a small subset (562 instances) of the full dataset.
Reward signal is sparse (delayed until end of sequence), which is generally hard to optimize, though intermediate format rewards help.
Reproducibility
Code availability is not explicitly provided in the paper. Dataset details (4,139 instances) and construction methods are described. The base model is open weights (Qwen-2.5), but the trained weights and exact training scripts are not linked.
📊 Experiments & Results
Evaluation Setup
Memory construction followed by RAG-based Question Answering
RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training without a separate value function
BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query
Core Memory: A persistent text summary that stays in the agent's active context (RAM-like)
Episodic Memory: A chronologically organized collection of timestamped events (log-like)
Semantic Memory: A structured collection of discrete factual statements or knowledge (database-like)
LoCoMo: Long-Context Modeling—a setting or benchmark focused on processing very long input sequences