GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of responses to the same prompt against their group average, removing the need for a critic model
RLVR: Reinforcement Learning from Verifiable Rewards—RL setting where the reward is based on a verifiable outcome (e.g., correct math answer) rather than a human preference model
SFT: Supervised Fine-Tuning—training on ground-truth reasoning traces before applying RL
Excess Length Reduction: A metric quantifying how much a method reduces the length increase caused by RL compared to the original SFT model baseline
Token Efficiency: A metric defined as reward divided by response length, prioritizing high-reward answers that use fewer tokens
t-digest: A probabilistic data structure for estimating quantiles (e.g., median, percentiles) from streaming data with low memory footprint
KL penalty: Kullback-Leibler divergence penalty—a regularization term ensuring the RL policy does not drift too far from the reference model
Pareto-optimal: A state where no metric (e.g., accuracy) can be improved without degrading another (e.g., length)