RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using outcomes that can be automatically checked (e.g., code compilation, correct math answer).
Fill-in-the-Middle (FIM): A training objective where the model predicts a missing span of text given the surrounding prefix and suffix.
Distractors: Incorrect options in a multiple-choice question designed to be plausible enough to confuse a model that isn't reasoning correctly.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt, used here as the training recipe.
FineWeb: A large-scale, high-quality dataset of web text used for LLM pre-training.
ProRL: A specific RL training recipe (and resulting model series) focusing on verifiable rewards and process supervision.
AoPS: Art of Problem Solving—a forum and curriculum for high-difficulty mathematics, often used as a source for reasoning data.