DeepSeekMath: Pushing the limits of Mathematical Reasoning in Open LMs

📝 Paper Summary

Mathematical reasoning in LLMs Reinforcement Learning for reasoning Data curation for pre-training

DeepSeekMath achieves state-of-the-art open-source mathematical reasoning by pre-training on 120B high-quality math tokens mined from Common Crawl and applying a novel critic-free reinforcement learning algorithm (GRPO).

Core Problem

Open-source language models significantly trail behind proprietary models like GPT-4 and Gemini-Ultra in mathematical reasoning due to insufficient high-quality pre-training data and inefficient reinforcement learning methods.

Why it matters:

Mathematical reasoning is a complex, structured task where standard LLMs struggle without domain-specific training
Existing open-source math datasets are small (e.g., OpenWebMath is 13.6B tokens) compared to proprietary efforts
Standard PPO reinforcement learning requires a memory-intensive value function (critic) model, making it computationally expensive to scale

Concrete Example: Existing open-source base models like Llemma-34B achieve only 25.3% on the MATH benchmark. DeepSeekMath-Base 7B, trained on the new corpus, reaches 36.2%, and with GRPO reinforcement learning, improves to 51.7%, solving problems where base models fail to generate correct reasoning chains.

Key Novelty

DeepSeekMath Corpus & Group Relative Policy Optimization (GRPO)

Constructs a massive 120B token math corpus by iteratively training a fastText classifier to mine Common Crawl, using OpenWebMath as a seed and refining via human annotation
Introduces GRPO, a reinforcement learning algorithm that removes the critic model entirely; instead of a value function, it estimates baselines from the average score of a group of outputs generated for the same input
Initializes the math model from a code-trained base (DeepSeek-Coder) rather than a general LLM, leveraging the synergy between coding and mathematical logic

Architecture

The iterative data collection pipeline for constructing the DeepSeekMath Corpus.

Evaluation Highlights

DeepSeekMath-RL 7B achieves 51.7% accuracy on the competition-level MATH benchmark, outperforming all open-source 7B models and approaching GPT-4 performance levels
DeepSeekMath-Base 7B achieves 36.2% on MATH, surpassing the 540B parameter Minerva model (33.6%) despite being ~77x smaller
GRPO improves GSM8K accuracy from 82.9% (Instruct) to 88.2% and MATH accuracy from 46.8% (Instruct) to 51.7% using only instruction tuning data

Breakthrough Assessment

9/10

Significant breakthrough in both data engineering (proving Common Crawl has massive untapped math data) and RL methodology (GRPO), enabling a 7B model to beat 540B baselines and approach GPT-4.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning and problem solving via text generation and code execution

Inputs: Natural language math problems (English and Chinese)

Outputs: Chain-of-thought solution text or Python code programs + final answer

Pipeline Flow

Iterative Data Selection (Seed -> Train Classifier -> Mine Common Crawl -> Annotate -> Retrain)
Continual Pre-training (DeepSeek-Coder Base -> Math Training)
Instruction Tuning (CoT, Program-of-Thought, Tool-integrated data)
Reinforcement Learning (GRPO with group sampling)

System Modules

Data Selector

Identify and filter math-related content from Common Crawl

Model or implementation: fastText classifier

Base Model

Learn general mathematical reasoning representations

Model or implementation: DeepSeekMath-Base 7B (initialized from DeepSeek-Coder-Base-v1.5)

RL Optimizer

Optimize policy for correct reasoning using group relative rewards

Model or implementation: GRPO (Group Relative Policy Optimization)

Novel Architectural Elements

Critic-less RL architecture: The system removes the value function network entirely, using group statistics for advantage estimation
Initialization from Code LLM: Explicitly leverages code-trained weights (DeepSeek-Coder) as the starting point for math reasoning, rather than a generic text LLM

Modeling

Base Model: DeepSeek-Coder-Base-v1.5 7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to reference model.

Formally: Maximize E[min(ratio * A, clip(ratio, 1-e, 1+e) * A)] - beta * KL(policy || ref)
Purpose: Estimate advantage without a critic.

Formally: A_i = (Reward_i - Mean(Rewards_group)) / Std(Rewards_group)

Adaptation: Full fine-tuning

Training Data:

Pre-training: 120B math tokens (DeepSeekMath Corpus), 4% AlgebraicStack, 10% arXiv, 20% Github code, 10% natural language
Instruction Tuning: Chain-of-thought, program-of-thought, tool-integrated reasoning data
RL: Subset of English instruction tuning data

Key Hyperparameters:

pre_training_batch_size: 10M tokens
pre_training_learning_rate: 4.2e-4 (max)
optimizer: AdamW (beta1=0.9, beta2=0.95)
+ 2 more
group_size: 64 (implied by self-consistency samples mentioned)
training_tokens: 500B (continual pre-training)

Compute: Pre-training uses HAI-LLM framework. Specific GPU hours not reported in the paper.

Comparison to Prior Work

vs. Minerva: DeepSeekMath uses 7B parameters vs 540B; uses massive Common Crawl mining vs focused arXiv/web data; uses GRPO vs standard fine-tuning
vs. Llemma: DeepSeekMath uses 120B math web tokens vs Proof-Pile-2 (mostly arXiv); Llemma uses standard loss, DeepSeekMath adds GRPO
vs. PPO: GRPO eliminates the critic model to save memory and uses group-based relative baselines
+ 1 more
vs. RFT: GRPO allows online policy updates and exploration, whereas RFT is offline on fixed samples

Limitations

Benchmark contamination filtering relies on n-gram matching, which might miss subtle contamination
RL improvements focus primarily on in-domain data; generalization to completely unseen tasks is observed but less extensively analyzed
The massive scale of Common Crawl data processing requires significant engineering effort not fully replicable by small labs

Reproducibility

Code: https://github.com/deepseek-ai/DeepSeek-Math

DeepSeekMath-Base 7B, Instruct 7B, and RL 7B models are publicly released. The DeepSeekMath Corpus (120B tokens) construction pipeline is detailed. Code repository link provided.

📊 Experiments & Results

Evaluation Setup

Few-shot chain-of-thought and program-of-thought prompting on standard math benchmarks.

Benchmarks:

GSM8K (Grade school math)
MATH (Competition-level math problems)
CMATH (Chinese elementary math)
Gaokao-MathQA (Chinese college entrance exam math)
miniF2F (Formal theorem proving (Isabelle))
MMLU-STEM (STEM multiple choice knowledge)

Metrics:

Accuracy (Pass@1)
Pass@1 with Tool Use
Formal Proof Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of DeepSeekMath-Base 7B against other base models shows it outperforming vastly larger models like Minerva 540B.
MATH	Accuracy	33.6	36.2	+2.6
GSM8K	Accuracy	54.0	64.2	+10.2
RL training with GRPO yields significant gains over the Instruction-Tuned baseline.
MATH	Accuracy	46.8	51.7	+4.9
GSM8K	Accuracy	82.9	88.2	+5.3
CMATH	Accuracy	84.6	88.8	+4.2

Experiment Figures

Learning curves (pass@1 accuracy on MATH) for DeepSeek-LLM 1.3B trained on different corpora (MathPile, OpenWebMath, Proof-Pile-2, DeepSeekMath Corpus).

Main Takeaways

Data Quality > Model Size: A 7B model trained on 120B high-quality math tokens beats a 540B model trained on standard datasets.
Code Initialization Works: Starting from a code-trained model (DeepSeek-Coder) provides a better foundation for math reasoning than general LLMs.
GRPO is Efficient and Effective: Critic-free RL reduces memory costs while boosting performance on both in-domain and out-of-domain tasks compared to SFT/RFT.
Multilingual Benefit: The DeepSeekMath Corpus includes non-English data, leading to superior performance on Chinese math benchmarks compared to English-centric baselines like Llemma.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Language Model Pre-training pipelines
Chain-of-Thought (CoT) reasoning

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that eliminates the critic model by using the average reward of a group of sampled outputs as the baseline

PPO: Proximal Policy Optimization—a standard RL algorithm that uses a value function (critic) to stabilize policy updates

DeepSeekMath Corpus: A 120B token dataset of mathematical web pages mined from Common Crawl using an iterative fastText classifier

RFT: Rejection Sampling Fine-Tuning—fine-tuning a model on its own best (correct) outputs

fastText: A library for efficient text classification and representation learning, used here to filter web pages

chain-of-thought: Prompting technique where the model generates intermediate reasoning steps before the final answer

program-of-thought: Prompting technique where the model generates executable code to solve the problem

DPO: Direct Preference Optimization—optimizing policy to adhere to preferences without explicit reward modeling

Minerva: A large closed-source PaLM-based model fine-tuned on mathematical content

OpenWebMath: An open-source dataset of mathematical web pages, used as a seed for DeepSeekMath Corpus

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from another, used as a penalty in RL to prevent model drift

critic model: In RL, a model that estimates the value (expected future reward) of a state; GRPO removes this component