ByteDance Seed,
Institute for AI Industry Research (AIR), Tsinghua University,
SIA-Lab of Tsinghua AIR and ByteDance Seed
arXiv.org
(2025)
MemoryAgentRLQA
📝 Paper Summary
Memory organizationAgentic AI
MemAgent enables LLMs to process effectively infinite context with linear complexity by using reinforcement learning to train a policy that iteratively compresses text chunks into a fixed-size memory.
Core Problem
Processing extremely long contexts (e.g., books, long-term agent memory) with standard Transformers incurs quadratic computational costs and performance degradation when extrapolating beyond training limits.
Why it matters:
Existing length extrapolation methods suffer from performance drops and slow processing speeds due to O(n^2) complexity on extremely long text
Sparse and linear attention mechanisms often require training from scratch or rely on rigid, human-defined patterns
Context compression approaches typically struggle with extrapolation and require external modules that disrupt standard generation processes
Concrete Example:When a standard LLM reads a 4 million token document, the attention mechanism becomes prohibitively expensive. MemAgent instead reads the document in segments, updating a small 'memory' note after each segment, similar to a human taking stenographic notes.
Treats memory updates not as appending to a log, but as an 'overwrite' action where the model decides what to keep or discard from a fixed-size buffer
Uses Multi-Conversation Reinforcement Learning to train the model to retain answer-critical information purely from outcome rewards (correct final answers), without human annotations for the memory content itself
Architecture
The MemAgent workflow showing the segment-by-segment processing stream.
Evaluation Highlights
Achieves >95% accuracy on the 512K token RULER benchmark
Extrapolates from an 8K training context to 3.5M token QA tasks with <5% performance loss
Maintains strictly linear O(N) computational complexity and constant memory usage per step regardless of input length
Breakthrough Assessment
9/10
Proposes a fundamental shift from attention-based context extension to RL-based memory compression, achieving linear scaling for infinite context without architectural changes to the base LLM.
⚙️ Technical Details
Problem Definition
Setting: Long-context Question Answering and Reasoning where input length N >> model context window C
Inputs: Long document split into K chunks (c^1, ..., c^K) and a query q
Outputs: Final answer a generated based on the final memory state m^K
Answer-Generation (Produces final result from memory)
System Modules
Context-Processing Module
Iteratively reads a text chunk and the previous memory, then generates a new updated memory
Model or implementation: Base LLM (shared weights)
Answer-Generation Module
Generates the final answer using the accumulated memory after all chunks are processed
Model or implementation: Base LLM (shared weights)
Novel Architectural Elements
Recurrent-style memory injection: The output of the previous step (memory tokens) is fed as input to the next step's context window, treating the Transformer as a recurrent network over chunks
Fixed-size memory constraints enforced during generation to ensure O(1) compute per chunk
Modeling
Base Model: LLM with 8K context window (Specific architecture like Llama-3 not explicitly named in text snippet, but implies standard dense Transformer)
Training Method: Group Relative Policy Optimization (GRPO) adapted for Multi-Conversation workflows
Objective Functions:
Purpose: Optimize the policy to generate memories that lead to correct answers.
Formally: GRPO objective (Eq 5) using importance sampling weights and KL penalty.
Purpose: Define success for QA tasks with equivalent answers.
Formally: Reward = 1 if predicted answer matches any ground truth, 0 otherwise (Eq 6).
Purpose: Define success for Multi-Value retrieval tasks.
Formally: Reward based on the intersection of predicted and ground truth sets (Eq 7).
Training Data:
Trained on documents up to 32K length
Evaluated on documents up to 4M length
Key Hyperparameters:
memory_size: 1024 tokens
chunk_size: 5000 tokens
context_window: 8K
Comparison to Prior Work
vs. Extrapolation: MemAgent avoids performance degradation on extreme lengths by processing segments independently
vs. Linear Attention: MemAgent works with standard Transformer architectures without training from scratch or custom kernels
vs. Context Compression: MemAgent uses end-to-end RL to learn *what* to compress, rather than heuristics or separate compressor modules
vs. Search-R1/Agent-R1 [not cited in paper]: MemAgent optimizes long-context memory specifically, whereas these optimize tool-use trajectories
Limitations
The paper snippet does not report performance on tasks requiring fine-grained citations of specific positions in the original text (which might be lost in memory compression)
Depends on a verifiable outcome reward, which may be difficult to define for open-ended creative writing tasks
Code link provided (https://memagent-sialab.github.io/). The paper describes the algorithm (Multi-Conv DAPO/GRPO) and the reward functions mathematically.
📊 Experiments & Results
Evaluation Setup
Long-context Question Answering and Retrieval
Benchmarks:
RULER (Synthetic long-context benchmark (Needle in a Haystack, etc.))
QA Tasks (Question Answering on documents up to 4M tokens)
Metrics:
Accuracy / Success Rate
Performance Loss (relative to short context)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Illustration of the Multi-Conv DAPO optimization process.
Main Takeaways
The method successfully extrapolates from 8K/32K training to 3.5M/4M test tokens, a massive scaling factor rarely seen in standard extrapolation.
Computational cost is strictly linear O(N), solving the quadratic bottleneck of standard Transformers.
The 'overwrite' memory strategy works effectively without losing critical information, evidenced by high RULER scores (>95%).
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same input to stabilize training
DAPO: Direct Alignment from Predictive Outcomes—an algorithm typically used to align models based on final results rather than step-by-step labels
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that allows for some length extrapolation
RULER: A benchmark for evaluating long-context capabilities of LLMs
O(N) complexity: Linear complexity—computation time grows directly in proportion to input size, rather than quadratically
Multi-Conv: Multi-Conversation—the authors' training approach where multiple independent dialogue trajectories are generated and optimized simultaneously
KV Cache: Key-Value Cache—stored intermediate states in a Transformer that allow it to avoid recomputing past tokens, usually growing with sequence length