Common 7B Language Models Already Possess Strong Math Capabilities

📝 Paper Summary

Mathematical Reasoning Data Synthesis for LLMs Supervised Fine-Tuning (SFT)

Standard small language models already possess latent math abilities that are masked by generation instability, which can be unlocked by scaling supervised fine-tuning with large amounts of synthetic data.

Core Problem

Small language models (e.g., LLaMA-2 7B) are widely believed to lack strong math capabilities without extensive pre-training, often showing low accuracy on benchmarks like MATH.

Why it matters:

Current beliefs suggest only massive models (>50B parameters) or math-specific pre-training can achieve high performance, limiting accessibility and efficiency.
The 'instability issue' means models often generate correct answers in latent space (high Pass@N) but fail to output them consistently (low Pass@1), wasting potential.
Scaling supervised fine-tuning is typically limited by the scarcity of high-quality, publicly available math datasets.

Concrete Example: On the MATH benchmark, a LLaMA-2 7B model achieves only 7.9% accuracy with a single generation (Pass@1), but 72.0% if allowed 256 attempts (Pass@256), proving it 'knows' the math but cannot reliably produce it.

Key Novelty

Xwin-Math (Scaling Synthetic SFT)

Demonstrates that base LLaMA-2 models have high 'Pass@256' accuracy, indicating latent capability, but suffer from instability (low Pass@1).
Uses GPT-4 Turbo to generate a massive scale of synthetic math questions (up to 960K) derived from existing datasets like GSM8K and MATH.
Applies a 'verify-then-generate' pipeline where synthetic questions are validated by the generator itself before being used for large-scale SFT.

Architecture

Scaling curves of accuracy on GSM8K and MATH benchmarks as a function of SFT data size, comparing real vs. synthetic data.

Evaluation Highlights

LLaMA-2 7B achieves 82.6% on GSM8K using 960K synthetic samples, outperforming previous 7B baselines by +14.2%.
LLaMA-2 7B reaches 40.6% on the difficult MATH benchmark, surpassing previous state-of-the-art 7B models by +20.8%.
LLaMA-2 70B achieves 90.6% on GSM8K and 52.8% on MATH, outperforming GPT-4-0314 on the MATH benchmark.

Breakthrough Assessment

8/10

Significantly shifts the perspective on small model capabilities, showing that data scaling in SFT—not just pre-training—can unlock strong reasoning. Achieves SOTA for open-source models at the time.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning tasks where a model generates a step-by-step Chain-of-Thought (CoT) solution followed by a final answer.

Inputs: Math problem text q

Outputs: Reasoning path (CoT) and final answer a

Pipeline Flow

Data Synthesis Phase: Reference Question → GPT-4 Generation → Verification → CoT Generation
Training Phase: Base Model + Synthetic Data → SFT → Xwin-Math Model

System Modules

Question Generator (Data Synthesis)

Create new math questions based on a reference question

Model or implementation: GPT-4 Turbo

Verifier (Data Synthesis)

Filter low-quality generated questions by attempting to solve them

Model or implementation: GPT-4 Turbo

CoT Generator (Data Synthesis)

Produce step-by-step solutions for verified questions

Model or implementation: GPT-4 Turbo

Math Reasoner

Solve math problems using learned reasoning patterns

Model or implementation: LLaMA-2 (7B/13B/70B) or Mistral-7B

Modeling

Base Model: LLaMA-2 (7B, 13B, 70B), Mistral-7B, Llemma-7B

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full model parameters

Training Data:

Synthetic GSM8K data: scaled up to 960K samples
Synthetic MATH data: scaled up to 480K samples
Source: Generated via GPT-4 Turbo based on training sets of original benchmarks

Key Hyperparameters:

optimizer: Adam
learning_rate: 2e-5 (7B/13B/70B), 2e-6 (Mistral)
lr_schedule: Cosine with 4% linear warm-up
+ 3 more
epochs: 3
batch_size: Not explicitly reported in the paper
max_token_length: 2048

Compute: 8x Nvidia H100 GPUs. Max resource usage: 1900 H100 GPU hours for 70B model on 960K data.

Comparison to Prior Work

vs. MuggleMath: Scales data much higher (960K vs ~30K-100K typically) using simpler generation prompting.
vs. MetaMath: Uses GPT-4 Turbo for higher quality synthesis and larger scale, achieving significantly higher accuracy (+20.8% on MATH 7B).
vs. WizardMath: Achieves comparable or better results purely via SFT data scaling without complex Reinforcement Learning pipelines.

Limitations

Relies on closed-source GPT-4 Turbo for data synthesis, which incurs API costs.
Accuracy on extremely complex problems still trails behind state-of-the-art closed models (GPT-4) for smaller model sizes.
Performance gains plateau on GSM8K pass@256, suggesting upper bound on capability for base models.
Requires large-scale computational resources for fine-tuning on nearly 1 million samples.

Reproducibility

Code: https://github.com/Xwin-Math

Code available at https://github.com/Xwin-Math. Synthetic data generation prompts provided in Appendix A. Exact batch sizes not specified. Uses GPT-4 Turbo API for data generation (closed source dependency).

📊 Experiments & Results

Evaluation Setup

Greedy decoding for main results; Temperature 0.7 for Pass@N analysis.

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Competition-level math problems)
SVAMP (Elementary math problems (OOD test))
ASDiv (Diverse math problems (OOD test))
Hungarian National High School Exam (Challenging exam problems (OOD test))

Metrics:

Accuracy (Pass@1)
Pass@256
PassRatio@256
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results showing SFT data scaling leads to state-of-the-art performance on LLaMA-2 7B.
GSM8K	Accuracy	68.4	82.6	+14.2
MATH	Accuracy	19.8	40.6	+20.8
GSM8K	Accuracy	81.6	82.6	+1.0
GSM8K	Accuracy	83.5	90.6	+7.1
MATH	Accuracy	42.5	52.8	+10.3
Pass@N analysis reveals the instability issue in base models.
GSM8K	Pass@256	48.2	97.7	+49.5

Experiment Figures

Comparison of Pass@256 (potential) and PassRatio@256 (stability) as training data size increases.

Breakdown of error types (Calculation vs. Reasoning) on GSM8K as data scale increases.

Main Takeaways

Scaling synthetic SFT data (up to ~1M samples) shows linear or super-linear improvements in accuracy without saturation.
The primary bottleneck for small models is 'instability' (low PassRatio) rather than capability (Pass@N); SFT scaling fixes this stability.
Synthetic data generated by GPT-4 is nearly as effective as real data for training math reasoners.
Calculation errors are mitigated faster than reasoning errors as SFT data scale increases.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) for LLMs
Chain-of-Thought (CoT) prompting
Pass@N vs Pass@1 evaluation metrics

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained base model on labeled examples to follow instructions or learn specific formats

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

Pass@N: A metric measuring the probability that at least one correct answer is found within N generated samples for a given question

PassRatio@N: The percentage of correct answers within N generated samples, used here to measure the stability of the model's generation

Instability Issue: The phenomenon where a model frequently generates incorrect answers despite having the capability to generate the correct one (high potential, low reliability)

GSM8K: A benchmark dataset of high-quality grade school math word problems

MATH: A benchmark dataset of challenging competition-level mathematics problems