CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
TIR: Tool-Integrated Reasoning—interleaving natural language reasoning with executable code (e.g., Python) to perform precise calculations
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, removing the need for a separate value function network
RM: Reward Model—a model trained to predict the correctness or quality of a generated response, used to guide RL and sampling
SFT: Supervised Fine-Tuning—training the model on high-quality input-output pairs
Rejection Sampling: A method to generate training data by sampling many outputs from a model and keeping only those that are verified as correct
RFT: Rejection Fine-Tuning—iterative fine-tuning on data generated via rejection sampling from the model itself
MuggleMath: A specific method/framework for evolving and synthesizing math problems
FastText: A library for efficient text classification and representation learning
MinHash: A technique for quickly estimating how similar two sets are, used for deduplicating data