MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

📝 Paper Summary

Mathematical Reasoning Instruction Tuning Tool Use

MAmmoTH enhances mathematical reasoning in open-source LLMs by training on a curated hybrid dataset (MathInstruct) that combines natural language reasoning (CoT) with executable code generation (PoT).

Core Problem

Open-source LLMs lag significantly behind closed-source models (like GPT-4) in mathematical reasoning, and existing fine-tuning methods (like WizardMath) improve specific datasets but hurt generalization to out-of-domain tasks.

Why it matters:

Current dataset-specific fine-tuning creates 'specialist' models that fail on broader math tasks (e.g., improving GSM8K but degrading AQuA accuracy)
Chain-of-Thought (CoT) struggles with precise computation and complex algorithms, while Program-of-Thought (PoT) fails on abstract reasoning lacking API support
Bridging the gap between open-source and closed-source models requires a lightweight approach that doesn't rely on expensive continued pre-training on 100B+ tokens

Concrete Example: CoT prompts struggle to solve quadratic equations precisely, often making arithmetic errors. Pure PoT prompts fail on abstract algebra where no standard Python library exists. MAmmoTH switches between these strategies to solve both types effectively.

Key Novelty

Hybrid Instruction Tuning with MathInstruct

Curates 'MathInstruct', a dataset mixing 13 math datasets with both Chain-of-Thought (CoT) and Program-of-Thought (PoT) rationales to cover diverse math fields
Trains models to be flexible: they can reason via text (CoT) for abstract concepts or write Python programs (PoT) for precise calculation
Implements a hybrid decoding strategy during inference: first attempt to write and execute a program (PoT); if that fails, fall back to text-based reasoning (CoT)

Architecture

Overview of the MAmmoTH pipeline: constructing the MathInstruct dataset, fine-tuning base models, and the hybrid evaluation strategy.

Evaluation Highlights

MAmmoTH-7B achieves 35.2% accuracy on the competition-level MATH dataset, surpassing WizardMath-7B (10.7%) by over 3x
MAmmoTH-Coder-34B achieves 44% accuracy on MATH, outperforming GPT-4's CoT result
On out-of-domain (OOD) datasets, MAmmoTH models show 16% to 32% average accuracy gains compared to existing open-source baselines, proving better generalization

Breakthrough Assessment

9/10

MAmmoTH-Coder-34B beating GPT-4 (CoT) on MATH is a significant milestone for open-source models. The hybrid CoT/PoT approach effectively addresses the precision vs. reasoning trade-off.

⚙️ Technical Details

Problem Definition

Setting: General-purpose mathematical reasoning across diverse difficulty levels (elementary to college) and formats (open-ended and multiple-choice)

Inputs: Natural language math problem q

Outputs: Predicted answer a, derived via either step-by-step text reasoning or executable code

Pipeline Flow

Input Processing (Prompt formatting)
Hybrid Prediction (PoT attempt -> Fallback to CoT)
Output Execution/Parsing (Run code or parse text)

System Modules

Input Processor

Formats the user question with a trigger phrase if PoT is desired ('Let's write a program...') or standard formatting for CoT

Model or implementation: MAmmoTH / MAmmoTH-Coder

Reasoning Engine

Generates the solution rationale. In Hybrid mode, first attempts to generate Python code. If execution fails, generates text reasoning.

Model or implementation: MAmmoTH / MAmmoTH-Coder (Llama-2 or Code Llama base)

Executor / Parser

Executes the Python code (if PoT) to get the answer, or extracts the final answer from text (if CoT)

Model or implementation: Python Interpreter

Novel Architectural Elements

Hybrid decoding mechanism: An inference-time heuristic that prioritizes program generation (PoT) for precision but falls back to text generation (CoT) for robustness when code fails

Modeling

Base Model: Llama-2 (7B, 13B, 70B) and Code Llama (34B)

Training Method: Full fine-tuning on MathInstruct dataset

Objective Functions:

Purpose: Standard causal language modeling loss.

Formally: Autoregressive cross-entropy loss.

Training Data:

MathInstruct: 260K total pairs
Source 1: 7 existing datasets (GSM8K, MATH, AQuA, etc.)
Source 2: 6 newly curated datasets with CoT/PoT rationales synthesized by GPT-4
Rationales: Mix of CoT (text) and PoT (code)

Key Hyperparameters:

learning_rate: 2e-5 (7B/13B), 1e-5 (34B/70B)
batch_size: 128
epochs: 3
+ 2 more
scheduler: cosine with 3% warm-up
max_sequence_length: 2048 tokens

Compute: DeepSpeed with ZeRO-3 stage used for 34B and 70B models

Comparison to Prior Work

vs. WizardMath: MAmmoTH uses a broader mix of datasets (13 vs 2) and hybrid rationales (CoT + PoT), leading to better OOD generalization
vs. Platypus: MAmmoTH focuses specifically on math with PoT integration, achieving higher scores on hard math benchmarks like MATH
vs. Galactica/Minerva: MAmmoTH uses lightweight instruction tuning rather than expensive continued pre-training [not cited in paper as direct baseline, but discussed]

Limitations

PoT approach struggles with abstract reasoning (logic, abstract algebra) where no APIs exist
Hybrid decoding adds inference latency due to potential two-pass generation (PoT then CoT)
Dependency on GPT-4 for synthesizing training rationales (distillation)

Reproducibility

Code: https://tiger-ai-lab.github.io/MAmmoTH/

Publicly available: MAmmoTH models (HuggingFace), MathInstruct dataset, and code. Missing: Exact compute hours/resources not detailed.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation for MAmmoTH; Few-shot (8-shot or 5-shot) for baselines where applicable.

Benchmarks:

GSM8K (Grade school math (In-Domain))
MATH (Competition-level math (In-Domain))
AQuA-RAT (Algebra word problems (In-Domain))
NumGLUE (Numerical reasoning (In-Domain))
MMLU-Math (Massive Multitask Language Understanding - Math subset (Out-of-Domain))
SAT-Math (Scholastic Assessment Test Math (Out-of-Domain))
SimulEq (Simultaneous Equations (Out-of-Domain))

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MAmmoTH models significantly outperform the leading open-source math model, WizardMath, on the challenging MATH dataset across different scales.
MATH	Accuracy	10.7	35.2	+24.5
MATH	Accuracy	14.0	36.5	+22.5
MATH	Accuracy	42.5	44.6	+2.1
Ablation studies confirm that combining CoT and PoT data yields better overall performance than either alone.
Average (9 datasets)	Accuracy	32.6	47.9	+15.3
Average (9 datasets)	Accuracy	41.8	47.9	+6.1

Experiment Figures

Comparison of different training data mixtures (CoT only, PoT only, Hybrid) on model performance across In-Domain and Out-of-Domain datasets.

Main Takeaways

Diverse data sources are critical: Training on single datasets (like GSM8K) produces specialists that fail to generalize; MathInstruct's 13-dataset mix creates robust generalists.
Code Llama is a superior base: MAmmoTH-Coder (based on Code Llama) consistently outperforms standard Llama-2 based models, even on non-coding math tasks, suggesting code training enhances reasoning.
Hybrid Decoding is effective: The strategy of attempting PoT first and falling back to CoT combines the precision of code with the flexibility of language.

📚 Prerequisite Knowledge

Prerequisites

Instruction Tuning (Fine-tuning LLMs on instruction-response pairs)
Chain-of-Thought (CoT) Prompting
Program-of-Thought (PoT) / Program-Aided Language Models (PAL)

Key Terms

CoT: Chain-of-Thought—a prompting method where the model generates intermediate natural language reasoning steps before the final answer

PoT: Program-of-Thought—a prompting method where the model generates executable code (e.g., Python) to solve the problem, offloading computation to an interpreter

MathInstruct: The authors' curated dataset of 260K instruction-response pairs, combining 13 source datasets with hybrid CoT and PoT rationales

Self-Instruct: A method for generating synthetic training data by prompting a strong model (like GPT-4) to create new examples based on seed instructions

Hybrid Decoding: An inference strategy where the model first tries to solve a problem via PoT; if the code is not executable, it falls back to CoT