RLVR: Reinforcement Learning with Verifiable Rewards—an RL training stage using binary rewards based on ground-truth correctness (e.g., correct math answer) rather than a reward model
SFT: Supervised Finetuning—training the model on prompt-completion pairs to learn instruction following
DPO: Direct Preference Optimization—a method to align models to preferences without an explicit reward model loop, using pairs of preferred/rejected responses
On-policy data: Training data generated by the current version of the model being trained, as opposed to 'off-policy' data generated by other models
Decontamination: The process of removing training examples that overlap with evaluation benchmarks to ensure fair testing
PPO: Proximal Policy Optimization—an RL algorithm used here for the RLVR stage
IFEval: Instruction Following Evaluation—a benchmark testing a model's ability to follow verifiable constraints (e.g., 'no capitalization')
GSM8K: Grade School Math 8K—a benchmark of grade-school level math word problems
MMLU: Massive Multitask Language Understanding—a general knowledge benchmark covering roughly 57 subjects