SUPO: Summarization Augmented Policy Optimization—the proposed algorithm that jointly trains task execution and context summarization
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt, used as the base optimization method here
MDP: Markov Decision Process—a mathematical framework for modeling decision making, here extended to include summarization steps
Policy Gradient: An RL technique that optimizes the policy parameters by following the gradient of the expected reward
working context: The immediate token sequence visible to the model at any specific step; in SUPO, this is reset after summarization
rollout: A complete sequence of interactions from the initial prompt to the final answer (or failure)
effective context length: The total amount of history the agent can effectively utilize across multiple summarized segments (Working Length × Number of Summaries)
overlong masking: A technique to zero-out gradients for rollouts that fail to finish within the allowed step or summary limits, preventing the model from learning to just summarize forever without solving the task