MathScale: Scaling Instruction Tuning for Mathematical Reasoning

📝 Paper Summary

Synthetic Data Generation Mathematical Reasoning

MathScale scales mathematical reasoning capabilities in LLMs by generating a massive synthetic dataset via a concept graph (extracted from seed questions) rather than simple augmentation, and introduces a comprehensive benchmark MwpBench.

Core Problem

Current mathematical instruction tuning relies on small datasets (like GSM8K/MATH) or limited augmentation methods (like rephrasing) that produce examples too similar to the original training set, restricting scalability.

Why it matters:

LLMs struggle with multi-step complex reasoning required for math, lagging behind their general problem-solving skills
Existing high-quality math datasets are tiny (e.g., GSM8K has only ~7.5K examples), limiting the effectiveness of fine-tuning
Augmentation methods like WizardMath or MetaMath are bounded by the number of operations and generate repetitive data, failing to cover new concept combinations

Concrete Example: Augmentation methods might just change numbers or rephrase 'John has 5 apples' to 'John possesses 5 apples'. MathScale instead extracts 'Arithmetic' and 'Subtraction', then recombines them with 'Money' from a different problem to generate a fundamentally new question about buying items.

Key Novelty

MathScale (Concept-Graph-Based Generation)

Extracts high-level 'topics' and 'knowledge points' from seed questions to build a concept graph, abstracting away the specific question text
Simulates human 'connection forging' by performing random walks on this graph to create novel combinations of mathematical concepts
Prompts GPT-3.5 to generate new questions based on these novel concept combinations, resulting in 2 million diverse QA pairs significantly different from the seed data

Architecture

The MathScale data generation pipeline: from seed questions to concept extraction, graph construction, random walk sampling, and final question generation.

Evaluation Highlights

MathScale-7B achieves 35.0% micro-average accuracy on MwpBench, outperforming the best equivalent-sized open-source peer (MetaMath-7B) by 42.9%
MathScale-Mistral-7B reaches performance parity with GPT-3.5-Turbo on MwpBench micro and macro averages
Achieves state-of-the-art performance across all 10 datasets in MwpBench compared to open-source models of equivalent size

Breakthrough Assessment

8/10

Strong methodological contribution in synthetic data generation (concept graph vs. simple rephrasing) leading to massive gains (+40% range) over strong baselines. Also contributes a significant consolidated benchmark.

⚙️ Technical Details

Problem Definition

Setting: Mathematical Instruction Tuning and Evaluation

Inputs: Natural language math question q

Outputs: Step-by-step reasoning chain and final answer a

Pipeline Flow

Concept Extraction (GPT-3.5 extracts topics/KPs from seed data)
Concept Graph Construction (Build weighted graph from co-occurrences)
Concept Composition (Random walk sampling to get new topic/KP sets)
Question Generation (GPT-3.5 generates QA pairs from sampled concepts)

System Modules

Concept Extractor (Data Generation)

Extract meta-information (topics and knowledge points) from seed questions

Model or implementation: GPT-3.5-Turbo-0613

Graph Constructor (Data Generation)

Build a graph to model relationships between mathematical concepts

Model or implementation: Deterministic algorithm

Concept Sampler (Data Generation)

Generate novel combinations of concepts via random walks

Model or implementation: Graph Random Walk Algorithm

Question Generator (Data Generation)

Synthesize new math problems based on sampled concepts

Model or implementation: GPT-3.5-Turbo-0613

Novel Architectural Elements

Concept-Graph-Driven Generation: Using a graph of extracted concepts to guide the generation of synthetic data, ensuring concept diversity rather than just textual diversity

Modeling

Base Model: LLaMA-2-7B, LLaMA-2-13B, Mistral-7B

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Minimize the negative log-likelihood of the target tokens.

Formally: Standard Cross-Entropy Loss

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

MathScaleQA (2M synthetic pairs)
MwpBench training set (20K pairs)

Key Hyperparameters:

batch_size: 128
epochs: 3
learning_rate: 2e-5

Compute: Not reported in the paper

Comparison to Prior Work

vs. WizardMath: MathScale generates new questions from concept combinations rather than modifying existing question text, allowing for greater scalability.
vs. MetaMath: MathScale is less dependent on the original training examples' phrasing, whereas MetaMath variations are semantically similar to seeds.
vs. MAmmoTH: MathScale relies on a massive synthesized dataset (2M) rather than just aggregating existing datasets.

Limitations

Validation step using GPT-4 was found ineffective and removed, leaving potential for incorrect synthetic solutions.
Relies on proprietary GPT-3.5 for data generation, which may change over time.
Evaluation is limited to zero-shot accuracy; few-shot performance not extensively analyzed.

Reproducibility

The authors plan to open-source the evaluation framework. The prompt templates for concept extraction and question generation are provided in the paper. The exact seed questions are from the training set of MwpBench (public datasets). Code URL is not provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation using greedy decoding.

Benchmarks:

MwpBench (Mathematical Reasoning (Word Problems)) [New]
GSM8K (Grade School Math)
MATH (Competition Math)
GaokaoBench-Math (Chinese College Entrance Exam Math)
CollegeMath (College-level Math Textbooks) [New]

Metrics:

Accuracy (Micro Average)
Accuracy (Macro Average)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of 7B models on MwpBench (Micro Avg and Macro Avg) showing MathScale's dominance.
MwpBench (Micro Avg)	Accuracy	24.5	35.0	+10.5
MwpBench (Macro Avg)	Accuracy	26.1	37.5	+11.4
Comparison of MathScale-Mistral against GPT-3.5 on MwpBench.
MwpBench (Micro Avg)	Accuracy	41.8	42.0	+0.2
MwpBench (Macro Avg)	Accuracy	44.6	45.0	+0.4

Experiment Figures

Schematic of the Concept Graph structure showing relationships between Topics and Knowledge Points.

Main Takeaways

MathScale effectively scales up math reasoning data, allowing 7B models to significantly outperform their size class.
The concept-graph approach generates diverse data that generalizes well, even to out-of-domain test sets like GaokaoBench-Math where no training data was seen.
The proposed MwpBench reveals that models optimized for GSM8K/MATH (like MetaMath) may not generalize as well to broader math tasks compared to MathScale.

📚 Prerequisite Knowledge

Prerequisites

Instruction Tuning (Fine-tuning LLMs on instruction-following data)
Chain-of-Thought (CoT) prompting
Graph theory basics (nodes, edges, random walks)

Key Terms

MwpBench: A new benchmark proposed in this paper comprising 10 distinct math datasets (K-12 to college level) with a unified evaluation protocol

MathScaleQA: The synthetic dataset of 2 million math question-answer pairs generated by the MathScale pipeline

Concept Graph: A graph where nodes are math topics/knowledge points and edges represent co-occurrence in seed questions, used to sample new concept combinations

GSM8K: A popular dataset of grade school math word problems

MATH: A dataset of challenging competition-level mathematics problems

Knowledge Points (KPs): Fine-grained math concepts (e.g., 'Pythagorean theorem') extracted from questions

Topics: High-level mathematical subjects (e.g., 'Geometry', 'Algebra') extracted from questions

Fuzzy Match: An answer verification method that matches predicted answers to ground truth even if formatted slightly differently (e.g., allowing for minor text variations)

Greedy Decoding: A decoding strategy that always selects the highest probability token at each step, eliminating randomness