← Back to Paper List

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

(Germany) Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, Yunpu Ma
Ludwig Maximilian University of Munich, Technical University of Munich, University of Cambridge, University of Hong Kong, Technical University of Darmstadt, University of Edinburgh
arXiv, 8/2025 (2026)
Memory RL Agent RAG Reasoning

📝 Paper Summary

Memory recall Memory organization Agentic RAG pipeline
Memory-R1 trains two specialized agents via reinforcement learning to autonomously manage memory operations and filter retrieved information for long-context reasoning, rather than relying on static heuristics.
Core Problem
Existing memory-augmented LLMs rely on static heuristics or manual instructions for memory updates, failing to adaptively decide what to store, update, or delete during evolving conversations.
Why it matters:
  • Heuristic retrieval often returns too much irrelevant noise or misses crucial context, confusing the model during reasoning
  • Vanilla LLMs frequently misinterpret memory updates (e.g., treating new information as a contradiction rather than an addition), leading to data loss
  • Supervised fine-tuning is impractical for memory management because labeling every memory operation decision in long conversations is prohibitively expensive
Concrete Example: A user says 'I adopted a dog named Buddy' and later adds 'I adopted another dog named Scout'. A standard system issues DELETE+ADD, overwriting the first dog. Memory-R1 issues an UPDATE to consolidate: 'Andrew adopted two dogs, Buddy and Scout'.
Key Novelty
Outcome-Driven Reinforcement Learning for Memory Management
  • Treats memory operations (ADD, UPDATE, DELETE, NOOP) as RL actions optimized directly against the final answer correctness, removing the need for intermediate labels
  • Splits the responsibility into two learned agents: a Memory Manager that maintains the storage state and an Answer Agent that learns to filter/distill retrieved memories before reasoning
Evaluation Highlights
  • +28.5% improvement in F1 score on the LoCoMo benchmark using LLaMA-3.1-8B-Instruct with GRPO compared to the MemoryOS baseline
  • Achieves state-of-the-art performance with only 152 training QA pairs, demonstrating extreme data efficiency
  • Zero-shot generalization to unseen benchmarks (MSC and LongMemEval) yields consistent gains across single-hop, multi-hop, and temporal reasoning tasks
Breakthrough Assessment
8/10
Significant shift from heuristic to learned memory management. The ability to achieve SOTA with extremely limited data (152 samples) and strong zero-shot transfer suggests a highly effective generalized mechanism.
×