Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

📝 Paper Summary

Mathematical Reasoning LLM Post-training Reinforcement Learning from Human/AI Feedback

Qwen2.5-Math achieves state-of-the-art mathematical reasoning by integrating self-improvement cycles across pre-training, post-training, and inference, leveraging a specialized reward model to guide data synthesis and reinforcement learning.

Core Problem

General-purpose language models often struggle with complex mathematical reasoning and precise calculations due to insufficient specialized pre-training data and lack of rigorous verification mechanisms.

Why it matters:

Mathematical reasoning is a key indicator of AGI capabilities but remains a stumbling block for many models
Standard CoT prompting often fails on algorithmic tasks (e.g., finding roots of equations) where precise calculation is needed
Current open-source models lag behind closed-source frontiers (like GPT-4o) in specialized math benchmarks

Concrete Example: When asked to solve a complex root-finding problem, a standard CoT model might hallucinate arithmetic steps. In contrast, Qwen2.5-Math uses Tool-Integrated Reasoning to generate Python code that computes the roots precisely, guided by a reward model that verifies the final answer.

Key Novelty

Full-Pipeline Self-Improvement for Math

Uses the previous model iteration (Qwen2-Math) to synthesize massive scale pre-training data and supervision signals for the next iteration (Qwen2.5-Math)
Integrates a math-specific Reward Model (RM) not just for ranking, but to drive rejection sampling for SFT data creation and to guide Group Relative Policy Optimization (GRPO) in reinforcement learning
Combines Chain-of-Thought (natural language) and Tool-Integrated Reasoning (code execution) in a unified training recipe

Architecture

The iterative self-improvement pipeline for developing Qwen2.5-Math from Qwen2-Math.

Evaluation Highlights

Qwen2.5-Math-72B-Instruct achieves 83.6 (CoT) and 85.3 (TIR) on the MATH benchmark, outperforming GPT-4o and Gemini Math-Specialized 1.5 Pro
Qwen2.5-Math-1.5B-Instruct scores ~80 on MATH with Python Interpreter (TIR), surpassing most 70B+ open-source models
Qwen2.5-Math-72B-Instruct solves almost all problems in the AMC 2023 dataset with RM assistance

Breakthrough Assessment

9/10

Sets a new state-of-the-art for open-source math models, beating leading closed-source models on key benchmarks. The efficacy of the self-improvement loop and the performance of the 1.5B model are particularly notable.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving using natural language (CoT) and code execution (TIR)

Inputs: Math query q (English or Chinese)

Outputs: Reasoning path (text or code) and final answer a

Pipeline Flow

Input Query
Inference Mode Selection (CoT or TIR)
Reasoning Generation (with optional RM guidance)
Output Execution (for TIR)
Final Answer Extraction

System Modules

Base Model

Generate reasoning steps and code blocks

Model or implementation: Qwen2.5-Math-Instruct (1.5B, 7B, or 72B)

Python Interpreter

Execute code blocks generated by the model during TIR mode

Model or implementation: Standard Python Environment

Reward Model (Optional)

Score candidate responses during best-of-N sampling inference

Model or implementation: Qwen2.5-Math-RM

Novel Architectural Elements

Integration of a dedicated dense Reward Model into the inference loop for Best-of-N sampling
Dual-mode training pipeline supporting both pure CoT and Tool-Integrated Reasoning (Python) within the same model weights

Modeling

Base Model: Qwen2.5-1.5B/7B/72B (initialized from these general base models)

Training Method: Group Relative Policy Optimization (GRPO) following Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Train Reward Model to rank responses.

Formally: Listwise ranking loss over pairs of correct/incorrect responses.
Purpose: Optimize policy to maximize reward without drifting too far.

Formally: GRPO objective maximizing advantage A_i = (r_i - mean(r)) / std(r) - beta * KL(pi_theta || pi_ref).

Training Data:

Pre-training: Qwen Math Corpus v2 (>1T tokens)
SFT CoT: 2.5M samples (2M English, 500K Chinese) via iterative rejection sampling
SFT TIR: 395K samples (annotated + synthetic) via online Rejection Fine-Tuning

Key Hyperparameters:

learning_rate: 1e-5 (7B), 5e-6 (72B) for RL; 2e-5 (1.5B/7B), 5e-6 (72B) for SFT
batch_size: 512 (RL global batch)
kl_coefficient: 1e-3
+ 2 more
group_size_G: 32 (samples per query in RL)
context_length: 4096 tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Qwen2-Math: Adds Chinese support, TIR mode, stronger base model (Qwen2.5), larger pre-training corpus (1T vs 700B)
vs. GPT-4o: Specialized self-improvement pipeline allows smaller models (72B) to outperform GPT-4o on math benchmarks
vs. DeepSeek-Math: Qwen2.5-Math explicitly integrates TIR (code execution) into the primary training pipeline alongside CoT, rather than just as a prompting strategy [not cited in paper]

Limitations

Heavy reliance on synthetic data which may contain hallucinations or biases not fully filtered
Reinforcement learning requires verifiable ground truth (final answers), limiting applicability to open-ended math questions without clear answers
TIR mode security risks when executing generated code are not detailed

Reproducibility

Code: https://github.com/QwenLM/Qwen2-Math

Base, Instruct, and Reward models are publicly available on Hugging Face. Evaluation scripts are on GitHub. TIR demo available via Qwen-Agent. Training code and exact hardware details are not provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot evaluation on standardized math benchmarks using both CoT and TIR modes.

Benchmarks:

GSM8K (Grade school math)
MATH (Competition-level math)
GaoKao (Chinese college entrance exam math)
AMC23 (American Math Competition 2023)
AIME24 (American Invitational Mathematics Examination 2024)

Metrics:

Accuracy (Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Base model performance comparisons showing significant improvements of Qwen2.5-Math base over Qwen2-Math and Qwen2 general models.
MATH	Accuracy	61.5	66.8	+5.3
GSM8K	Accuracy	89.5	91.6	+2.1
Instruct model performance comparisons demonstrating SOTA results against proprietary models like GPT-4o.
MATH	Accuracy	79.2	83.6	+4.4
MATH	Accuracy	83.6	85.3	+1.7
Small model performance highlights showing 1.5B parameter efficiency.
MATH	Accuracy	Not reported in the paper	80	Not reported in the paper

Main Takeaways

Self-improvement works: Iterative data synthesis and reward model training significantly boost performance over the previous generation.
TIR boosts CoT: Tool-Integrated Reasoning consistently outperforms pure Chain-of-Thought, especially for computational heavy tasks.
Parameter efficiency: The 7B Instruct model rivals the previous generation's 72B model, and the 1.5B model punches way above its weight class when equipped with tools.
Bilingual strength: Significant gains in Chinese math problems alongside English improvements.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) prompting
Rejection Sampling
Proximal Policy Optimization (PPO) variants

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

TIR: Tool-Integrated Reasoning—interleaving natural language reasoning with executable code (e.g., Python) to perform precise calculations

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, removing the need for a separate value function network

RM: Reward Model—a model trained to predict the correctness or quality of a generated response, used to guide RL and sampling

SFT: Supervised Fine-Tuning—training the model on high-quality input-output pairs

Rejection Sampling: A method to generate training data by sampling many outputs from a model and keeping only those that are verified as correct

RFT: Rejection Fine-Tuning—iterative fine-tuning on data generated via rejection sampling from the model itself

MuggleMath: A specific method/framework for evolving and synthesizing math problems

FastText: A library for efficient text classification and representation learning

MinHash: A technique for quickly estimating how similar two sets are, used for deduplicating data