← Back to Paper List

Scaling llm multi-turn rl with end-to-end summarization-based context management

M Lu, W Sun, W Du, Z Ling, X Yao, K Liu, J Chen
ByteDance Seed, Stanford University, Carnegie Mellon University
arXiv, 10/2025 (2025)
Memory RL Agent

📝 Paper Summary

Linear memory RL-based tool use
SUPO enables reinforcement learning agents to solve tasks exceeding their context windows by treating context summarization as a trainable action within the policy optimization process.
Core Problem
In long-horizon tasks requiring many tool calls, the accumulated context quickly exceeds the model's context window, causing truncated history, degraded instruction following, and excessive rollout costs.
Why it matters:
  • Existing RL pipelines cannot train agents on tasks that require more steps than fit in a single context window, creating a fundamental scalability barrier
  • Simple context truncation or heuristic summarization methods are not optimized for the specific task, often discarding critical information needed for future steps
  • Longer contexts significantly slow down rollout time, becoming a bottleneck for training efficiency
Concrete Example: In a coding task requiring iterative comparisons of student heights, a standard agent fills its context window and loses track of progress. Without summarization, it forgets which array index it was processing (e.g., index 5). SUPO learns to generate a summary explicitly stating 'The next step would be... starting with the pair (5,7)', preserving the exact state needed to continue.
Key Novelty
Summarization Augmented Policy Optimization (SUPO)
  • Treats summarization not as a fixed heuristic, but as a learnable action within the Markov Decision Process (MDP)
  • Splits long rollouts into multiple shorter 'complete trajectories' separated by summarization steps, where the gradient of the whole rollout is the sum of gradients from these sub-trajectories
  • Jointly optimizes the agent's ability to solve the task (reasoning/tool use) AND its ability to write useful summaries that retain critical state information
Evaluation Highlights
  • +14.0% success rate improvement on BrowseComp-Plus using SUPO compared to GRPO baseline with standard context management
  • Achieves higher performance while using significantly shorter working context (4K vs 32K on CodeGym), proving effective compression
  • Demonstrates test-time scaling: models trained with a limit of 2 summaries can generalize to use up to 23 summaries at test time, improving accuracy to 60.0%
Breakthrough Assessment
8/10
Provides a principled mathematical framework (policy gradient derivation) for end-to-end learned memory management in RL, addressing the critical context bottleneck in long-horizon agents.
×