SFPO: Slow-Fast Policy Optimization—the proposed update rule combining inner updates, repositioning, and slow correction
GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes rewards within a group of outputs for the same prompt, eliminating the need for a value function critic
Fast Trajectory: A sequence of multiple inner gradient updates performed on the same batch of data to stabilize the search direction
Reposition: An interpolation step that pulls the parameters back towards the initial on-policy point to control the distribution mismatch caused by inner updates
Slow Correction: A final gradient step applied after repositioning to align with local curvature
Pass@1: The percentage of problems where the model's first generated answer is correct
Rollout: The process of generating a complete sequence (reasoning chain + answer) from the policy given a prompt
On-policy: Learning from data generated by the current version of the policy (as opposed to old or historical data)
Off-policy drift: The discrepancy between the data distribution the model is learning from and the model's current policy distribution, which occurs when reusing data for multiple updates