RLVR: Reinforcement Learning with Verifiable Reward—training LLMs using outcomes (like correct math answers) as reward signals
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average, removing the need for a separate value network
Pass@k: A metric measuring the probability that at least one correct solution is generated given k independent attempts
Capability Boundary Collapse: A phenomenon where an RL-tuned model becomes better at frequent queries (high Pass@1) but loses the ability to solve diverse/hard queries (low Pass@k) compared to the base model
SFT: Supervised Fine-Tuning—training a model on labeled examples
MIS: Multiple Importance Sampling—a technique to estimate properties of a target distribution using samples from multiple proposal distributions to reduce variance
OOD: Out-of-Distribution—tasks or data that differ significantly from the training data
On-policy: RL methods that learn only from data generated by the current policy
Off-policy: RL methods that learn from data generated by other policies (e.g., historical data or external demonstrations)