GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average of multiple samples from the same prompt, eliminating the need for a critic model
RLVR: Reinforcement Learning from Verifiable Rewards—training models using objective success signals (e.g., code compiles, math answer is correct) rather than human preference labels
Distillation: Training a smaller student model to mimic the outputs or reasoning traces of a larger, more capable teacher model
KL divergence: Kullback–Leibler divergence—a statistical distance measuring how one probability distribution differs from another; often used as a penalty in RL to keep the model close to its initial state
Cold-start data: Supervised fine-tuning data used to initialize a model before RL, ensuring it has basic capabilities to generate correct answers occasionally
Microbatches: Subsets of a batch used for gradient accumulation to handle memory constraints and variable sequence lengths
Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct