RLVR: Reinforcement Learning with Verifiable Rewards—training models using outcomes that can be automatically checked (e.g., math answers) rather than human preference labels.
Decontamination: The process of removing training examples that are identical or semantically similar to test set questions to prevent cheating.
Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct.
SFT: Supervised Fine-Tuning—training a model on a dataset of input-output pairs to teach it a specific behavior or format.
R1 Solutions: Reasoning paths generated by the DeepSeek-R1 model, used here as high-quality synthetic training data.
GPQA-Diamond: A difficult multiple-choice benchmark for graduate-level science and reasoning, used to test generalization beyond pure math.
RL: Reinforcement Learning—a training method where an agent learns to make decisions by receiving rewards or penalties for its actions.