RLVR: Reinforcement Learning with Verifiable Rewards—fine-tuning LLMs using binary feedback (correct/incorrect) from verifiable answers (e.g., math problems)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs generated from the same prompt, removing the need for a separate value function
Empirical Pass Rate: The proportion of correct answers generated for a specific prompt within a sampled group (denoted as µ), used as a proxy for sample difficulty
Loss Scale Issue: The phenomenon where training losses disproportionately cluster at certain difficulty levels due to static weighting, causing the model to ignore other valid learning signals
Clip-higher: A technique where the clipping range in the PPO objective is asymmetric or adjusted to prevent entropy collapse
Token-mean loss aggregation: A method of calculating loss by averaging per-token losses rather than summing them, often used to stabilize training for variable-length reasoning chains