RLVR: Reinforcement Learning with Verifiable Rewards—using binary success/failure feedback (like passing unit tests) to train models
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages from a group of responses without a value critic
Gradient Gap: A theoretical quantity measuring the difference in expected score functions between high-reward and low-reward response regions
Length Normalization: Scaling the gradient update by 1/length to prevent long responses from causing unstable, high-variance updates
KL divergence: A measure of how much the trained policy deviates from the reference policy; often used as a penalty in RLHF
Pass@1: The probability that a single generated response is correct
PPO: Proximal Policy Optimization—a standard RL algorithm that clips updates to ensure stability
REINFORCE: A fundamental policy gradient algorithm that updates probabilities proportional to rewards
Lipschitz continuous: A smoothness condition ensuring functions don't change too rapidly, crucial for convergence proofs
Score function: The gradient of the log-probability of the policy
Policy Gradient: A class of algorithms that optimize a policy by differentiating the expected reward objective
Step size: The learning rate or magnitude of the parameter update in an optimization step