RLVR: Reinforcement Learning from Verifier Rewards—training models using binary feedback (correct/incorrect) on final answers rather than human-annotated steps
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs generated from the same prompt, removing the need for a value function
learning cliff: A phenomenon where a model consistently fails a set of hard problems, leading to zero reward variance and zero gradients, effectively stopping learning on those examples
scaffolding: A pedagogical concept applied here as temporary, hierarchical support (hints) that helps the model solve problems it couldn't solve independently
advantage: A value measuring how much better a specific action is compared to the average action in a given state
on-policy: RL methods where the data used for updates is generated by the current policy being optimized
off-policy: RL methods that use data generated by a different policy (e.g., a teacher model or older version of the current model)
pass@1: The percentage of problems where the model generates the correct answer on its first attempt
importance sampling: A statistical technique used to estimate properties of a distribution using samples from a different distribution, often requiring correction weights
KL divergence: A measure of how one probability distribution differs from a second, reference probability distribution