GDRO: Group Distributionally Robust Optimization—an optimization framework that minimizes the worst-case loss across defined groups rather than the average loss
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt, eliminating the need for a separate value critic
Pass@k: A metric measuring the probability that at least one of k generated solutions is correct
EMA: Exponential Moving Average—a statistical calculation giving more weight to recent data points
KL divergence: A measure of how one probability distribution differs from a second, reference probability distribution; used here to prevent the model from drifting too far from the base model
Shadow price: In optimization, the marginal utility of relaxing a constraint; here, it represents the value of assigning additional rollouts to a specific group
Zero-sum game: A situation where one agent's gain is exactly the other's loss; here, the adversary tries to maximize loss (find hard data) while the learner tries to minimize it
PPO: Proximal Policy Optimization—a standard RL algorithm that constraints policy updates to ensure stability
SFT: Supervised Fine-Tuning—training on labeled examples; here, the method is applied in a zero-SFT setting (direct RL on base model)
Hysteresis: A system property where changes lag behind input; used here to prevent prompts from rapidly oscillating between difficulty bins due to noise