RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward is determined by a clear objective criteria, like a correct math answer.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt against their group average, removing the need for a critic model.
Pass Rate: The probability p(x) that the current policy generates a correct answer for a given prompt x.
ZPD: Zone of Proximal Development—an educational theory stating learning is most efficient on tasks that are neither too easy nor too difficult.
Reverse KL Divergence: A measure of difference between the current policy and the optimal policy; maximizing this divergence (in the negative direction) drives learning.
Asynchronous Sampling: A technique to generate replacement data samples in parallel while the main training loop continues, preventing bottlenecks when data is filtered out.
SFT: Supervised Fine-Tuning—training on labeled data before RL to provide a competent starting point.