GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding the need for a separate value network
Dr. GRPO: GRPO Done Right—the authors' proposed unbiased variant of GRPO that removes response-length normalization and standard deviation division to recover the standard PPO objective
Aha moment: The phenomenon where a model self-corrects or reflects during generation (e.g., saying 'Wait, let me recheck'), typically associated with advanced reasoning
SFT: Supervised Fine-Tuning—training a model on labeled examples (question-answer pairs)
PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to ensure stability
Token efficiency: The ratio of correct reasoning to generated token length; avoiding unnecessarily long incorrect responses
Overthinking: A failure mode where reasoning models generate excessively long chains of thought without reaching a correct answer, often exacerbated by optimization bias