RLVR: Reinforcement Learning with Verifiable Rewards—fine-tuning LLMs using binary rewards based on whether the final answer matches the ground truth
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance, often used without a separate critic model
Accuracy: The probability of generating a correct answer in a single attempt (pass@1)
Capability: The probability that a correct answer exists in the model's output distribution (approximated by pass@k with large k, e.g., k=256)
Pass@k: A metric measuring the probability that at least one correct answer is generated out of k independent samples
Distillation: Supervised fine-tuning of a student model on outputs generated by a teacher model (or itself)
In-distribution: Questions where the model has a non-negligible probability (e.g., > 1%) of generating a correct answer
Self-Distillation: Fine-tuning a model on its own correct responses to valid problems
Rejection Sampling: A method of filtering generated data to keep only the samples that meet a certain criteria (e.g., correct answer) for training