RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are deterministic (e.g., correct code execution or math answer) rather than a learned reward model
PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to prevent destructive large steps
GRPO: Group Relative Policy Optimization—a critic-free RL algorithm that normalizes rewards within a group of outputs for the same prompt to estimate advantages
off-policy drift: The discrepancy between the policy used to generate data (behavior policy) and the policy currently being trained (target policy), which can destabilize training
importance sampling: A statistical technique used to estimate properties of a target distribution using samples from a different distribution (used here to correct for off-policy drift)
Async Ratio: A parameter defining the maximum allowable version gap between the training model and the rollout model
rollout: The phase in RL where the model interacts with the environment or generates text to create training data
long-tail distribution: A scenario where a small number of samples (responses) are significantly longer than average, causing disproportionate delays