DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

📝 Paper Summary

Mathematical Reasoning in LLMs Reinforcement Learning for Reasoning Data Curation for Pre-training

DeepSeekMath achieves state-of-the-art open-source mathematical reasoning by mining 120B high-quality math tokens from the web and introducing Group Relative Policy Optimization (GRPO) to efficiently reinforce reasoning capabilities.

Core Problem

Open-source models lag significantly behind proprietary models (GPT-4, Gemini-Ultra) in mathematical reasoning, partly due to the scarcity of high-quality, open math pre-training data and inefficient RL methods.

Why it matters:

Math reasoning is a proxy for complex structural thinking, but high-performance models are closed-source
Standard web crawls (Common Crawl) contain vast math knowledge but are noisy and hard to filter effectively using static classifiers
Existing RL methods like PPO are memory-intensive and often require a separate critic model, making them expensive to scale for reasoning tasks

Concrete Example: A standard fastText classifier trained only on OpenWebMath misses niche math forums in Common Crawl because it lacks diverse positive examples. DeepSeekMath's iterative mining finds these by identifying 'math-heavy' domains (like mathoverflow.net) and re-annotating uncollected URLs to retrain the classifier.

Key Novelty

DeepSeekMath Corpus & Group Relative Policy Optimization (GRPO)

Iterative FastText Mining: Instead of a static filter, the paper uses an iterative loop where a classifier mines math content, high-density domains are identified, and new positive examples are manually annotated to retrain the classifier, eventually recalling 120B math tokens.
GRPO (Group Relative Policy Optimization): An RL algorithm that eliminates the critic model entirely. Instead of estimating a value function, it samples a group of outputs for the same prompt and uses the group average as the baseline to calculate advantages, saving memory.

Architecture

The iterative data collection pipeline for the DeepSeekMath Corpus.

Evaluation Highlights

51.7% accuracy on the competition-level MATH benchmark (DeepSeekMath-RL 7B), approaching GPT-4 and Gemini-Ultra performance levels.
88.2% accuracy on GSM8K using GRPO, an improvement of +5.3% over the instruction-tuned baseline solely through reinforcement learning on in-domain data.
DeepSeekMath-Base 7B outperforms Minerva 540B on MATH (36.2% vs 33.6%) despite being ~77x smaller, highlighting the value of high-quality pre-training data.

Breakthrough Assessment

9/10

The model sets a new standard for open-source math reasoning at the 7B scale, beating 70B+ models. GRPO is a significant methodological efficiency gain for RLHF, and the data pipeline offers a reproducible blueprint for domain adaptation.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning via text generation and code execution

Inputs: Math problem statement q (optionally with chain-of-thought or tool-use instructions)

Outputs: Solution steps and final answer a

Pipeline Flow

Data Mining (Iterative FastText on Common Crawl)
Pre-training (DeepSeek-Coder-Base + Math Data)
Instruction Tuning (CoT/PoT/Tool data)
Reinforcement Learning (GRPO)

System Modules

Data Mining Pipeline

Extract high-quality math tokens from Common Crawl

Model or implementation: fastText classifier (iteratively updated)

Base Model

General math and code reasoning

Model or implementation: DeepSeekMath-Base 7B (initialized from DeepSeek-Coder-Base-v1.5 7B)

RL Optimizer

Optimize reasoning policy based on correctness rewards

Model or implementation: GRPO (Group Relative Policy Optimization)

Novel Architectural Elements

GRPO Architecture: Modification of the PPO actor-critic framework that removes the value function network (critic) entirely. Instead, it computes advantages by sampling a group of outputs for one input and normalizing rewards relative to the group average.

Modeling

Base Model: DeepSeek-Coder-Base-v1.5 7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while keeping updates stable.

Formally: J_GRPO(θ) = E[1/G * sum(min(ratio * A, clip(ratio, 1-e, 1+e) * A) - beta * D_KL)]
Purpose: Estimate advantage without a critic.

Formally: A_i = (r_i - mean(r_group)) / std(r_group)

Adaptation: Full fine-tuning

Training Data:

Pre-training: 120B Math tokens (DeepSeekMath Corpus) + others (Total 500B)
Instruction Tuning: CoT, PoT, and tool-integrated reasoning data
RL: Subset of English instruction tuning data (GSM8K and MATH questions)

Key Hyperparameters:

learning_rate: 5.3e-4 (pre-training max)
batch_size: 4M tokens (pre-training)
beta_1: 0.9
+ 3 more
beta_2: 0.95
weight_decay: 0.1
group_size_GRPO: 64 (implied by self-consistency samples mentioned, explicit number not in hyperparam table)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Minerva: DeepSeekMath-Base 7B outperforms Minerva 540B on MATH, proving high-quality web data > parameter count.
vs. PPO: GRPO removes the critic model to save memory and uses group-relative baselines.
vs. Llemma: DeepSeekMath uses a larger, iteratively mined web corpus (120B) vs Proof-Pile-2 (mostly ArXiv/Code).

Limitations

Evaluation primarily focuses on 7B scale; scaling laws for GRPO at larger sizes (e.g., 70B+) are not fully explored.
Reliance on Common Crawl means data quality depends heavily on the initial seed and classifier performance.
RL gains are demonstrated on math; generalization to other reasoning domains is less emphasized.

Reproducibility

Code: https://github.com/deepseek-ai/DeepSeek-Math

DeepSeekMath-Base, Instruct, and RL models (7B) are publicly available. The DeepSeekMath Corpus (120B tokens) construction method is detailed, but the corpus itself is not explicitly linked as a download. Code for GRPO is part of the release.

📊 Experiments & Results

Evaluation Setup

Few-shot Chain-of-Thought (CoT) and Program-of-Thought (PoT) evaluation on math benchmarks.

Benchmarks:

GSM8K (Grade-school math word problems)
MATH (Competition-level mathematics)
CMATH (Chinese elementary school math)
MMLU-STEM (STEM multiple-choice knowledge)
miniF2F (Formal theorem proving (Isabelle))

Metrics:

Pass@1 Accuracy
Pass@1 Accuracy (with Tool/Python)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DeepSeekMath-Base 7B significantly outperforms other open-source base models and even the much larger closed-source Minerva 540B on key benchmarks.
MATH	Pass@1 Accuracy (CoT)	33.6%	36.2%	+2.6%
GSM8K	Pass@1 Accuracy (CoT)	40.3%	64.2%	+23.9%
RL training (GRPO) provides substantial gains over the Instruction-Tuned version, demonstrating the effectiveness of the proposed optimization method.
MATH	Pass@1 Accuracy	46.8%	51.7%	+4.9%
GSM8K	Pass@1 Accuracy	82.9%	88.2%	+5.3%
CMATH	Pass@1 Accuracy	84.6%	88.8%	+4.2%

Experiment Figures

Learning curves (Pass@1 on MATH) for models trained on different corpora (DeepSeekMath Corpus vs Proof-Pile-2 vs OpenWebMath).

Main Takeaways

Data quality trumps model size: A 7B model trained on 120B high-quality math tokens beats a 540B model trained on fewer/lower-quality math tokens.
Code training is a strong foundation: Initializing from DeepSeek-Coder proved better than a general LLM, enhancing both math and general reasoning.
GRPO is highly effective: It improves performance significantly over SFT without needing a memory-heavy critic model, showing gains even on out-of-domain tasks (Chinese math) not seen during RL.
ArXiv papers yielded no notable improvements on the benchmarks used, contradicting common assumptions about their utility for general math reasoning tasks compared to web data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Proximal Policy Optimization (PPO)
Familiarity with Chain-of-Thought (CoT) prompting
Knowledge of standard math benchmarks (GSM8K, MATH)

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that estimates baselines from the average reward of a group of sampled outputs rather than using a separate critic model

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm that constrains policy updates to prevent instability

Chain-of-Thought: A prompting strategy where the model generates intermediate reasoning steps before the final answer

Program-of-Thought: A reasoning method where the model generates executable code (e.g., Python) to solve the problem

Rejection Sampling Fine-Tuning (RFT): A method where the model generates multiple samples, correct ones are kept, and the model is fine-tuned on these correct samples

DPO: Direct Preference Optimization—an alignment method optimizing policy based on preference pairs without explicit reward modeling

OpenWebMath: A publicly available dataset of high-quality mathematical web text

fastText: A library for efficient text classification and representation learning, used here to filter web pages

KL divergence: A statistical distance measure used as a penalty in RL to prevent the trained model from deviating too far from the reference model