GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of samples for the same input, eliminating the need for a critic model
SFT: Supervised Fine-Tuning—training the model on labeled examples (prompt, answer) before applying RL
On-policy: RL training where the data used for updates is generated by the current version of the model, ensuring the update direction is valid
Entropy collapse: A failure mode in RL where the model's policy becomes deterministic too quickly, losing exploration capabilities
CoT: Chain-of-Thought—a prompting or generation strategy where the model produces intermediate reasoning steps before the final answer
LiveCodeBench: A benchmark for evaluating code generation models on competitive programming problems, often using fresh problems to avoid contamination
AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for reasoning models
REINFORCE: A fundamental policy gradient algorithm in RL that updates model weights to maximize expected reward