OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

📝 Paper Summary

Synthetic data generation Instruction tuning for mathematics

OpenMathInstruct-2 is a massive open-source math instruction dataset created by optimizing data synthesis choices (solution format, teacher strength, question diversity) rather than aggressive filtering.

Core Problem

High-quality mathematical reasoning data is largely closed-source or restrictively licensed (e.g., GPT-generated), preventing researchers from understanding synthesis trade-offs and building strong open models.

Why it matters:

Closed data prevents the community from understanding the impact of algorithmic choices like CoT formats or teacher models on performance
Restrictive licenses on datasets like NuminaMath (generated by GPT-4o) prohibit commercial use
Previous open datasets like OpenMathInstruct-1 lack question diversity and representation of challenging problems

Concrete Example: Existing open datasets rely on fixed training sets (MATH/GSM8K), limiting diversity. A model trained on just 1K unique questions drops over 10% in accuracy compared to 6.5K questions, showing that simple solution augmentation isn't enough.

Key Novelty

Optimized Synthetic Data Pipeline for Math (OpenMathInstruct-2)

Synthesize 14M question-solution pairs using a strong open-weight teacher (Llama-3.1-405B) rather than weak student self-generation
Augment question diversity significantly by prompting the teacher to create new variations of seed problems, rather than just solving existing ones
Replace standard verbose Chain-of-Thought with a concise 'OpenMath CoT' format that strips unnecessary verbiage

Architecture

Overview of the data generation pipeline for OpenMathInstruct-2.

Evaluation Highlights

+15.9% absolute improvement on MATH benchmark for Llama-3.1-8B-Base finetuned on OpenMathInstruct-2 compared to Llama3.1-8B-Instruct (51.9% → 67.8%)
Outperforms NuminaMath-7B-CoT (previous best open-source) on GSM8K (91.7% vs 75.4%) and MATH (67.8% vs 55.2%)
The finetuned Llama-3.1-70B model achieves 71.9% on MATH, surpassing the official Llama3.1-70B-Instruct (67.9%)

Breakthrough Assessment

9/10

Significantly advances open-source math reasoning by providing a commercially usable dataset 8x larger than prior open alternatives and establishing strong baselines that beat official instruct models.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of base Large Language Models (LLMs) for mathematical reasoning tasks

Inputs: Natural language math problem q

Outputs: Step-by-step reasoning solution s followed by the final answer

Pipeline Flow

Seed Datasets (MATH, GSM8K) → Question Augmentation (Synthesis of new questions)
New Questions → Decontamination (Remove test-set overlaps)
Clean Questions → Solution Augmentation (Generate solutions via Teacher Model)
Solutions → SFT Dataset → Fine-tuning Base Model

System Modules

Question Augmentation (Data Synthesis)

Generate new math questions similar to seed training sets to increase diversity

Model or implementation: Llama-3.1-405B-Instruct

Decontamination

Remove synthesized questions that are paraphrases of benchmark test sets

Model or implementation: Sentence Transformer (embeddings) + Llama-3.1-405B-Instruct (judge)

Solution Augmentation (Data Synthesis)

Generate step-by-step solutions for questions

Model or implementation: Llama-3.1-405B-Instruct

Novel Architectural Elements

Usage of 'Base' prompt template to force Instruct models to follow custom CoT formats (OpenMath CoT) instead of their training default
Aggressive question synthesis pipeline resulting in 592K new unique questions (vs 15K in seed sets)

Modeling

Base Model: Llama-3.1-8B-Base and Llama-3.1-70B-Base

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize difference between predicted tokens and target solution tokens.

Formally: Standard Cross-Entropy Loss

Adaptation: Full fine-tuning

Training Data:

OpenMathInstruct-2: 14M question-solution pairs
Includes 600K unique questions (592K synthesized)

Key Hyperparameters:

learning_rate: 5e-6 (8B ablation), 2e-5 (8B final), 1e-5 (70B)
batch_size: 256 (ablation), 512 (final)
epochs: 4 (ablation), 2 (final)
+ 2 more
weight_decay: 1e-2
optimizer: AdamW

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. NuminaMath: OpenMathInstruct-2 is 8x larger (14M vs 1.7M total/860K CoT) and fully commercially permissive (open-weight synthesis vs GPT-4o)
vs. OpenMathInstruct-1: Adds massive question augmentation (600K vs 15K unique questions) and uses stronger teacher (Llama-3.1-405B)
vs. MetaMathQA: Uses open-weight teacher models allowing permissive licensing
+ 1 more
vs. Qwen2.5-Math: Fully releases the training data and pipeline [Qwen data is closed]

Limitations

Decontamination relies on heuristics and LLM judging which may not be perfect
Majority voting is used as ground truth proxy for synthesized questions, potentially introducing noise
70B model improvements are less consistent across benchmarks compared to 8B model gains
Synthesized questions do not have explicit difficulty control instructions

Reproducibility

Code: https://github.com/Kipok/NeMo-Skills

publicly available (https://github.com/Kipok/NeMo-Skills). Dataset and models released. Detailed ablation settings provided. Decontamination pipeline uses commercially available Llama-3.1-405B.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation with greedy decoding and majority voting (maj@256)

Benchmarks:

MATH (Competition-level math problems)
GSM8K (Grade school math word problems)
AMC 2023 (American Mathematics Competitions problems)
AIME 2024 (American Invitational Mathematics Examination)
Omni-MATH (Comprehensive math benchmark)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies on dataset design choices using Llama-3.1-8B-Base on MATH validation set.
MATH Val	Accuracy	40.6	44.5	+3.9
MATH Val	Accuracy	30.1	37.9	+7.8
MATH Val	Accuracy	34	45	+11
Main performance comparison of the final OpenMath2 models against baselines.
MATH	Accuracy	51.9	67.8	+15.9
GSM8K	Accuracy	75.4	91.7	+16.3
MATH	Accuracy	67.9	71.9	+4.0

Experiment Figures

Scaling law of SFT data size vs MATH Test Accuracy for Llama3.1-8B-Base.

Impact of adding low-quality (incorrect) solutions on SFT performance.

Main Takeaways

Solution format matters: excessively verbose solutions (like standard Llama CoT) hurt performance; concise formats work better.
Teacher strength is critical: data from a strong teacher (405B) significantly outperforms self-generated data from a weaker student (8B), even when filtered for correctness.
SFT is robust to noise: models tolerate up to 20% incorrect solutions (wrong reasoning path or mismatched questions) with minimal performance degradation, suggesting precise filtering is less critical than scale.
Question diversity scales: increasing the number of unique questions provides large gains that haven't saturated even at 14M samples.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Instruction Tuning (SFT)
Familiarity with Chain-of-Thought (CoT) prompting
Knowledge of LLM sampling strategies (nucleus sampling)
Basic understanding of data contamination in benchmarks

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset of instructions and responses

Teacher-Student Distillation: Using a larger, stronger model (teacher) to generate training data for a smaller model (student)

OpenMath CoT: A concise reasoning format proposed in this paper that removes excessive verbiage found in standard Llama CoT traces

Nucleus Sampling: A decoding strategy that samples from the smallest set of top tokens whose cumulative probability exceeds a threshold p

Fair Downsampling: A sampling method ensuring all unique questions are represented as equally as possible when reducing dataset size

Decontamination: The process of removing training examples that are too similar to test set benchmarks to prevent unfair evaluation

Rejection Sampling: Generating multiple solutions and keeping only those that reach the correct final answer