GRPO: Group Relative Policy Optimization—an RL algorithm that reduces variance by comparing outputs within a group rather than using a separate value network critic.
Pass@1: A metric measuring the percentage of problems where the model's first generated solution passes all test cases.
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, synthetic critiques) before applying reinforcement learning.
PPO: Proximal Policy Optimization—a standard RL algorithm; found to be unstable in this paper due to credit assignment difficulties.
Regression Rate: The frequency with which initially correct solutions are modified into incorrect ones during the refinement process.
Execution Feedback: Output from a code sandbox (e.g., error messages, test outputs) used to verify solution correctness.
Test-time scaling: Improving model performance during inference (not training) by using more computation, such as generating multiple drafts or performing iterative revisions.