SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples (question, reasoning path, answer)
ICL: In-Context Learning—prompting the model with a few examples at inference time without updating weights
RFT: Rejection Sampling Fine-Tuning—generating multiple solutions with the model, filtering for correct answers, and fine-tuning on these correct reasoning paths
GSM8K: Grade School Math 8K—a benchmark dataset of high-quality grade school math word problems
Distinct reasoning paths: Reasoning paths that use a unique sequence of equations to reach the solution, used to measure diversity
maj1@1: Accuracy of the top-1 greedy decoded answer
maj1@100: Accuracy using majority voting over 100 sampled reasoning paths
DeepSpeed ZeRO3: A memory optimization technique for training large models by partitioning optimizer states, gradients, and parameters across GPUs