RLVR: Reinforcement Learning with Verifiable Rewards—training models using outcomes (like passing unit tests) as ground-truth rewards rather than a learned reward model
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance, often used without a critic model
SFT: Supervised Fine-Tuning—training a model on high-quality input-output pairs before RL
Entropy Expansion: A training phase designed to increase the randomness and diversity of the model's outputs to prevent it from getting stuck in repetitive failure patterns
Rollout: A single attempt by the model to generate a solution during RL training; '8 rollouts' means the model generates 8 different solutions for one problem to estimate gradients
Pass@k: A metric measuring the probability that at least one of the top k generated solutions is correct
Curriculum Learning: Training on easier tasks first or organizing training data by difficulty to help the model learn progressively
MoE: Mixture-of-Experts—a model architecture where different sub-models (experts) are activated for different inputs, allowing large capacity with lower inference cost
Arena Learning: An iterative data selection method where a model is trained on subsets of data to identify and retain 'hard' samples that it consistently gets wrong
OJ: Online Judge—a system that automatically tests submitted code against hidden test cases (e.g., LeetCode, Codeforces)