DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

📝 Paper Summary

Mathematical Reasoning Datasets Reinforcement Learning with Verifiable Rewards (RLVR)

DeepMath-103K is a high-difficulty mathematical dataset constructed from informal forums, rigorously decontaminated, and formatted with verifiable answers to enable effective reinforcement learning for advanced reasoning.

Core Problem

Existing mathematical datasets for RL often lack sufficient difficulty, suffer from high contamination with evaluation benchmarks (up to 90%), or lack verifiable answers needed for reliable reward signals.

Why it matters:

Advanced models (like DeepSeek-R1) need increasingly harder problems to improve reasoning, but current open datasets are saturated with easy problems.
High contamination rates in training data (e.g., AIME problems) make evaluation scores unreliable and inflate perceived progress.
RL with Verifiable Rewards (RLVR) requires unambiguous final answers to prevent reward hacking, which many open-ended datasets lack.

Concrete Example: A raw problem from Math StackExchange might be a conversational forum post without a clear question or answer format. DeepMath-103K transforms this into a structured query with a verifiable symbolic answer, whereas standard scraping might leave it unusable for rule-based RL.

Key Novelty

DeepMath-103K Dataset Construction Pipeline

Sources data primarily from informal math forums (Math StackExchange) rather than recycling common competition sets like AIME, ensuring high novelty.
Implements a rigorous semantic decontamination process to remove problems similar to 14 major benchmarks, ensuring evaluation integrity.
Enforces 'verifiability' by filtering for problems where rule-based extractors recover consistent answers across multiple DeepSeek-R1 solution paths.

Architecture

The data curation pipeline for creating DeepMath-103K.

Evaluation Highlights

DeepMath-Omn-1.5B achieves 64.0% pass@1 on AIME24, surpassing o1-mini (63.6%) and o3-mini-low (60.0%).
DeepMath-Zero-7B improves pass@1 on AIME24 by +12.7% (from 42.9% to 55.6%) over the base Qwen-2.5-Math-7B model.
DeepMath-1.5B (initialized from R1-Distill) improves pass@1 on AIME25 by +6.0% (from 43.1% to 49.1%) after training.

Breakthrough Assessment

9/10

Provides a critical, high-quality resource (challenging, decontaminated, verifiable data) that directly addresses the bottleneck in open-source reasoning model development. Significant performance gains over strong baselines.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for mathematical reasoning

Inputs: Complex mathematical problem statement q

Outputs: Step-by-step reasoning chain followed by a verifiable final answer a

Pipeline Flow

Source Collection (Math StackExchange, etc.)
Decontamination (Remove benchmark overlaps)
Difficulty Filtering (Retain Level 5+)
Answer Verification (Ensure consistency across R1 paths)

System Modules

Source Collector (Data Curation)

Gather raw problems from web sources with high difficulty potential

Model or implementation: N/A (Data scraping/selection)

Decontaminator (Data Curation)

Remove problems semantically similar to test sets

Model or implementation: Llama-3.3-70B-Instruct (as Judge)

Difficulty Filter (Data Curation)

Assign difficulty ratings and filter out easy problems

Model or implementation: GPT-4o

Verifier (Data Curation)

Ensure answers are robustly extractable and consistent

Model or implementation: DeepSeek-R1

Novel Architectural Elements

Consistency-based verification pipeline: enforcing that a problem is only included if DeepSeek-R1 yields the exact same final answer across multiple generation paths, ensuring suitability for rule-based RL rewards.

Modeling

Base Model: Qwen-2.5-Math-7B, R1-Distill-Qwen-1.5B, OpenMath-Nemotron-1.5B

Training Method: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)

Adaptation: Full fine-tuning

Training Data:

DeepMath-103K (103,000 problems)
95K challenging problems (levels 5-9)
8K auxiliary problems (levels 3-5)

Compute: Data curation involved 127,000 H20 GPU hours and $138,000 in GPT-4o API fees.

Comparison to Prior Work

vs. Open-R1: DeepMath uses diverse forum data rather than recycling competition sets, resulting in 82.81K unique problems not found in other datasets.
vs. NuminaMath: DeepMath enforces a strict difficulty filter (Level 5+) and rigorous decontamination against 14 benchmarks.
vs. Orca-Math [not cited in paper]: DeepMath focuses on verifiable calculation/symbolic answers for RLVR, whereas Orca-Math focuses on diverse word problems.

Limitations

High cost of construction ($138k API fees + 127k GPU hours) makes replication expensive.
Reliance on proprietary models (GPT-4o, DeepSeek-R1) for curation and verification.
Focus on math limits direct applicability to non-STEM reasoning tasks without further adaptation.

Reproducibility

Code: https://github.com/zwhe/DeepMath

publicly available (https://github.com/zwhe/DeepMath). Dataset hosted at https://hf.co/datasets/zwhe/DeepMath-K. Code and model weights are released. Artifacts include the dataset with 3 R1 solutions per problem.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on mathematical and general science benchmarks.

Benchmarks:

AIME 2024 (Challenging Math Competition)
AIME 2025 (Challenging Math Competition)
MATH-500 (Standard Math Benchmark (subset))
GPQA-Diamond (Graduate-Level Science QA)

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DeepMath-Omn-1.5B (based on Nemotron) sets new SOTA for small models, beating proprietary logic-focused models.
AIME 2024	Pass@1	63.6	64.0	+0.4
AIME 2025	Pass@1	50.0	57.3	+7.3
DeepMath-Zero-7B (SFT only) shows massive gains over the Qwen base model, validating dataset quality.
AIME 2024	Pass@1	42.9	55.6	+12.7
AIME 2025	Pass@1	29.9	42.0	+12.1
Generalization to science domains: DeepMath models improve on biology, physics, and chemistry without specific training.
GPQA-Diamond (Biology)	Pass@1	39.5	42.3	+2.8
GPQA-Diamond (Physics)	Pass@1	32.6	34.8	+2.2

Experiment Figures

Venn diagram-style analysis of unique problems.

Main Takeaways

Training on DeepMath-103K consistently improves performance on hard benchmarks (AIME) across different model architectures (Qwen, Nemotron).
The dataset enables small models (1.5B parameters) to rival or exceed the performance of much larger or proprietary models (o1-mini) on math tasks.
The rigor of the decontamination process ensures that these gains are genuine reasoning improvements, not memorization.
Reasoning capabilities transfer to non-math STEM domains (biology, physics) despite the dataset being purely mathematical.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Large Language Model (LLM) fine-tuning
Mathematical reasoning benchmarks (AIME, MATH)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using outcomes that can be automatically checked (e.g., math answers) rather than human preference labels.

Decontamination: The process of removing training examples that are identical or semantically similar to test set questions to prevent cheating.

Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct.

SFT: Supervised Fine-Tuning—training a model on a dataset of input-output pairs to teach it a specific behavior or format.

R1 Solutions: Reasoning paths generated by the DeepSeek-R1 model, used here as high-quality synthetic training data.

GPQA-Diamond: A difficult multiple-choice benchmark for graduate-level science and reasoning, used to test generalization beyond pure math.

RL: Reinforcement Learning—a training method where an agent learns to make decisions by receiving rewards or penalties for its actions.