1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

📝 Paper Summary

Reasoning datasets Knowledge distillation Chain-of-thought (CoT) reasoning

The authors introduce a 1.4-million-entry reasoning dataset constructed by distilling DeepSeek-R1 and rigorously verifying outputs via sandboxes and reward models, enabling smaller models to surpass teacher baselines.

Core Problem

Open-source reasoning datasets are significantly smaller (<800k samples) than those used by proprietary models, limiting the community's ability to train effective reasoning models via Supervised Fine-Tuning (SFT).

Why it matters:

DeepSeek-R1's success demonstrated that high-quality, large-scale SFT data (800k+ samples) is critical for inference-time scaling
Existing open-source distillation efforts lack the scale and rigorous verification pipelines (using code execution and math checkers) needed to produce high-purity reasoning traces
Without large-scale, verified reasoning data, open-source models lag behind distilled counterparts in complex math and coding tasks

Concrete Example: A standard distilled model might hallucinate a reasoning step in a math problem without detection; this dataset uses 'math-verify' and reference answers to filter such errors, ensuring the model learns correct logic.

Key Novelty

Large-Scale Verified Distillation Pipeline (AM-DeepSeek-R1-Distilled)

Combines 500k curated open-source samples with 900k new samples distilled from DeepSeek-R1, scaled to 1.4 million total entries
Implements a multi-stage verification system: math problems are checked against reference answers, code is executed in sandboxes, and general reasoning is scored by a reward model

Architecture

The Data Construction Pipeline showing the three main stages: Raw Data Collection, Distilling, and Rejection Sampling.

Evaluation Highlights

+1.9% accuracy on MATH-500 for AM-Distill-Qwen-32B compared to DeepSeek-R1-Distill-Qwen-32B (96.2% vs 94.3%)
+6.5% accuracy on AIME 2024 for AM-Distill-Qwen-72B compared to DeepSeek-R1-Distill-Llama-70B (76.5% vs 70.0%)
Consistent improvements across GPQA-Diamond and LiveCodeBench benchmarks over DeepSeek-R1-Distilled baselines

Breakthrough Assessment

8/10

Provides a massive, rigorously verified dataset that allows open-source models to outperform the very models they were distilled from (DeepSeek-R1-Distilled series). Highly impactful resource.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Language Models for general reasoning tasks

Inputs: Reasoning problems (Math, Code, Science, General)

Outputs: Chain-of-thought reasoning traces followed by the final answer

Modeling

Base Model: Qwen2.5-32B and Qwen2.5-72B

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full fine-tuning implied (standard SFT)

Training Data:

Total size: 1.4 million entries
Sources: 0.5M from open-source (NuminaMath, MetaMathQA, OpenCoder, etc.), 0.9M distilled from DeepSeek-R1
Verification (Math): 'math-verify' for format/result check + Qwen2.5-7B-Instruct for consistency check
Verification (Code): Execution in sandbox environment with test cases
Verification (General): Scored by Decision-Tree-Reward-Llama-3.1-8B; low scores filtered
Deduplication: Semantic deduplication using embedding similarity
Difficulty filtering: Downsampling of easy/medium tasks to prioritize challenging reasoning

Key Hyperparameters:

max_generation_length: 32,768 tokens (for inference/evaluation)
temperature: 0.6 (evaluation)
top_p: 0.95 (evaluation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenThoughts: AM dataset is significantly larger (1.4M vs smaller scale) and includes broader domains beyond math
vs. DeepSeek-R1-Distill series: AM models use strictly verified data (sandbox/math-check) resulting in higher accuracy on benchmarks
vs. Standard SFT: Uses a rigorous rejection sampling pipeline based on reward models and ground-truth verification

Limitations

Potential factual inaccuracies in model-generated responses despite verification
Lack of thorough filtering for harmful instructions or responses
Nested relationships among data sources may lead to inaccuracy issues from original sources
No statistical significance tests reported for benchmark improvements

Reproducibility

Code: https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M

Dataset is publicly available on Hugging Face. Evaluation system prompts are provided in Table 1. Verification prompts (difficulty, category, correctness) are in Appendix B. Training hyperparameters (LR, batch size) are not explicitly reported.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on reasoning benchmarks with long context generation

Benchmarks:

AIME 2024 (Competition Mathematics)
MATH-500 (Mathematics)
GPQA-Diamond (Graduate-Level Science QA)
LiveCodeBench (Code Generation (2024-08 to 2025-01))

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
32B Model Comparisons: AM-Distill-Qwen-32B vs. DeepSeek-R1-Distill-Qwen-32B shows consistent improvements across all benchmarks.
AIME 2024	Accuracy (Pass@1)	72.6	72.7	+0.1
MATH-500	Accuracy	94.3	96.2	+1.9
GPQA-Diamond	Accuracy	62.1	64.3	+2.2
LiveCodeBench	Accuracy	57.2	59.1	+1.9
72B Model Comparisons: AM-Distill-Qwen-72B vs. DeepSeek-R1-Distill-Llama-70B shows larger gains, particularly in math competitions.
AIME 2024	Accuracy (Pass@1)	70.0	76.5	+6.5
MATH-500	Accuracy	94.5	97.0	+2.5
LiveCodeBench	Accuracy	57.5	59.7	+2.2

Main Takeaways

Strict data quality control (sandbox execution, math verification) allows models trained on fewer samples to outperform models trained on potentially noisier data
The AM-Distill-Qwen-72B model demonstrates superior scaling behavior, achieving a massive +6.5% gain on AIME 2024 over the 70B baseline
Combining open-source curated data with fresh distillations from DeepSeek-R1 provides a robust recipe for training reasoning models
Performance improvements are consistent across diverse domains (Math, Code, Science), validating the dataset's broad applicability

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) prompting
Reinforcement Learning from Human Feedback (RLHF) concepts (Reward Models)

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset to adapt it to a task

Distillation: The process of training a smaller 'student' model to mimic the outputs of a larger 'teacher' model (here, DeepSeek-R1)

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Pass@1: An evaluation metric measuring the percentage of problems where the model's first generated answer is correct

DeepSeek-R1: A strong open-source reasoning model used as the 'teacher' for generating the reasoning traces in this dataset

Reward Model: A model trained to predict the quality of a response, used here to filter out low-quality reasoning traces

Sandbox: An isolated computing environment used to safely execute and verify generated code against test cases