(RFT) Scaling Relationship on Learning Mathematical Reasoning with LLMs

📝 Paper Summary

Mathematical Reasoning LLM Scaling Laws Data Augmentation

Empirically establishes scaling laws for math reasoning, finding pre-training loss predicts performance better than model size, and rejection sampling fine-tuning (RFT) significantly boosts performance via diverse reasoning paths.

Core Problem

The scaling relationship between LLM capacity (parameters, pre-training loss) and mathematical reasoning ability is under-explored, making it difficult to predict performance or optimize data collection efforts.

Why it matters:

Understanding these laws helps allocate resources efficiently between pre-training better base models versus collecting more supervised data.
Generating high-quality math data is expensive; knowing how augmented data (RFT) scales is crucial for improving models without human effort.
Current methods often rely on expensive inference-time techniques (e.g., majority voting), whereas this work focuses on improving the base supervised model for efficient single-inference deployment.

Concrete Example: A LLaMA-7B model fine-tuned on standard data achieves only 35.9% on GSM8K. Simply increasing model size to 65B is expensive, while the paper shows that augmenting data via rejection sampling can boost the 7B model to 49.3%, rivaling larger models.

Key Novelty

Rejection Sampling Fine-Tuning (RFT) scaling & Pre-training Loss Indicator

Identifies pre-training loss as a better predictor of reasoning performance than parameter count.
Proposes Rejection Sampling Fine-Tuning (RFT) to augment data using the model's own correct reasoning paths.
Demonstrates that RFT performance scales with the number of *distinct* reasoning paths, and aggregating paths from multiple models yields superior results.

Architecture

A conceptual summary of the scaling relationships found in the paper.

Evaluation Highlights

LLaMA-13B with multi-model RFT achieves 52.1% accuracy on GSM8K, outperforming standard SFT (43.0%) by +9.1 points.
LLaMA-7B with multi-model RFT reaches 49.3%, surpassing the standard SFT baseline of 35.9% by +13.4 points.
Pre-training loss shows a strong linear correlation with SFT accuracy, while doubling supervised data volume yields log-linear improvements.

Breakthrough Assessment

7/10

Provides valuable empirical scaling laws for math reasoning and a highly effective, simple data augmentation strategy (RFT) that significantly boosts open-source model performance.

⚙️ Technical Details

Problem Definition

Setting: Supervised fine-tuning of pre-trained LLMs on mathematical reasoning datasets.

Inputs: Math word problem question q

Outputs: Chain-of-thought reasoning path r and numerical answer a

Pipeline Flow

Base LLM (Pre-trained)
SFT (Supervised Fine-Tuning on original data)
Inference & Rejection Sampling (Generate k paths, filter correct ones)
Deduplication (Keep distinct reasoning paths)
RFT (Fine-tuning on original + augmented data)

System Modules

Base LLM (Modeling)

Initial pre-trained model

Model or implementation: LLaMA (7B, 13B, 33B, 65B) or LLaMA-2 (7B, 13B)

SFT Model (Modeling)

Model fine-tuned on original GSM8K training set, used to generate augmented data

Model or implementation: Same architecture as Base LLM

Rejection Sampler

Filters generated paths to keep only those with correct final answers and valid Python calculation

Model or implementation: Python Evaluation script

RFT Model (Modeling)

Final model fine-tuned on the augmented dataset containing diverse reasoning paths

Model or implementation: Same architecture as Base LLM

Modeling

Base Model: LLaMA-7B, 13B, 33B, 65B; LLaMA-2-7B, 13B

Training Method: Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT)

Training Data:

GSM8K dataset (7.5k training examples)
Augmented datasets D' constructed via rejection sampling (k=100 samples per question)
Aggregated datasets D'_U13B combining samples from 7B and 13B models

Key Hyperparameters:

epochs: 3
batch_size: 128
learning_rate: 2e-5
+ 3 more
lr_warmup: 3%
sampling_temperature: 0.7 (for data generation)
optimizer: Not explicitly reported in the paper

Compute: RFT Inference (generating data): ~10 GPU hours for 7B, ~4.5k GPU hours for 65B (NVIDIA A100 80GB). SFT Training: ~0.6 GPU hours for 7B, ~80 GPU hours for 65B.

Comparison to Prior Work

vs. CoRE: RFT is simpler, requiring no trained verifier or MCTS
vs. STaR: RFT focuses on single-round rejection sampling scaling rather than iterative bootstrapping loops [not cited in paper as direct contrast, but conceptually similar]
vs. PaLM-540B: Outperforms PaLM-540B ICL (56.5) with LLaMA-65B SFT (59.3) and rivals it with much smaller models via RFT

Limitations

RFT was not applied to 65B/70B models due to computational cost.
Analysis is limited to GSM8K; generalization to other math datasets is not extensively explored.
Does not propose a definitive formula for scaling laws, only empirical observations.
Pre-training loss comparison across different model families (GPT-3 vs LLaMA) is approximate due to tokenizer/data differences.

Reproducibility

Code: https://github.com/OFA-Sys/gsm8k-ScRel

Code and augmented data released at https://github.com/OFA-Sys/gsm8k-ScRel. Uses public LLaMA/LLaMA-2 models and GSM8K dataset.

📊 Experiments & Results

Evaluation Setup

Greedy decoding for accuracy (maj1@1) and majority voting with 100 samples (maj1@100).

Benchmarks:

GSM8K (Grade school math word problems)

Metrics:

Accuracy (maj1@1)
Majority Voting Accuracy (maj1@100)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SFT Scaling: Performance improves with model scale and newer model versions.
GSM8K	Accuracy (maj1@1)	35.9	35.9	0.0
GSM8K	Accuracy (maj1@1)	43.0	43.0	0.0
RFT Scaling: Rejection Sampling Fine-Tuning significantly improves performance over SFT, especially for smaller models.
GSM8K	Accuracy (maj1@1)	35.9	41.7	+5.8
GSM8K	Accuracy (maj1@1)	43.0	49.1	+6.1
Multi-Model RFT: Aggregating samples from multiple models (7B, 7B-2, 13B, 13B-2) yields the best performance.
GSM8K	Accuracy (maj1@1)	35.9	49.3	+13.4
GSM8K	Accuracy (maj1@1)	43.0	52.1	+9.1
GSM8K	Accuracy (maj1@1)	50.0	55.4	+5.4

Experiment Figures

Histograms of unique reasoning paths for questions solved by SFT vs RFT models.

Main Takeaways

Pre-training loss is linearly correlated with downstream reasoning accuracy (SFT/ICL), making it a better predictor than parameter count.
SFT performance scales log-linearly with supervised data amount, but benefits diminish for better base models.
RFT effectively converts compute (sampling) into data, with performance scaling with the number of *distinct* reasoning paths found.
Aggregating diverse reasoning paths from multiple smaller models (e.g., 7B and 13B) can supervise models to achieve performance comparable to much larger baselines.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model pre-training and fine-tuning
Chain-of-thought (CoT) reasoning
Rejection sampling

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples (question, reasoning path, answer)

ICL: In-Context Learning—prompting the model with a few examples at inference time without updating weights

RFT: Rejection Sampling Fine-Tuning—generating multiple solutions with the model, filtering for correct answers, and fine-tuning on these correct reasoning paths

GSM8K: Grade School Math 8K—a benchmark dataset of high-quality grade school math word problems

Distinct reasoning paths: Reasoning paths that use a unique sequence of equations to reach the solution, used to measure diversity

maj1@1: Accuracy of the top-1 greedy decoded answer

maj1@100: Accuracy using majority voting over 100 sampled reasoning paths

DeepSpeed ZeRO3: A memory optimization technique for training large models by partitioning optimizer states, gradients, and parameters across GPUs