SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples (prompts and responses) to follow instructions
General SFT: The first training stage in this paper, focusing on diverse topics (coding, general knowledge) to build a reasoning foundation
Math SFT: The second training stage, focusing exclusively on mathematical problem-solving data
rm@8: A metric evaluating the performance of 'Best-of-N' sampling; specifically, generating 8 solutions and selecting the best one using a reward model
pass@1: The accuracy of the model when it generates a single response to a problem
In-breadth evolution: A synthetic data generation technique that creates new prompts by varying the topic or setting of a seed prompt while maintaining similar difficulty
Data decontamination: Removing training samples that overlap significantly with the test set to prevent the model from memorizing answers
Outcome reward model: A model trained to predict whether a final answer is correct, often used to rerank generated solutions
Greedy decoding: A generation strategy where the model always picks the most likely next token, resulting in a deterministic output