OpenThoughts: Data Recipes for Reasoning Models

📝 Paper Summary

Synthetic Data Generation Reasoning Models Supervised Fine-Tuning (SFT)

OpenThoughts systematically ablates data curation steps—sourcing, mixing, filtering, and teacher selection—to build a state-of-the-art open reasoning dataset that enables small models to outperform larger distilled baselines.

Core Problem

Training strong reasoning models relies on high-quality SFT data, but the optimal recipes for curating such data (question selection, filtering, teacher choice) remain proprietary and unexplored in open research.

Why it matters:

Current open-source efforts often rely on heuristics or a single teacher model (DeepSeek-R1) without verifying which data strategies actually yield better downstream performance.
Exploring the design space of data curation is prohibitively expensive for most researchers due to high inference costs for teacher models.
Lack of public knowledge prevents the community from reproducing frontier reasoning capabilities like those of o3 or DeepSeek-R1.

Concrete Example: When filtering questions for a reasoning dataset, standard methods like embedding distance or fastText classification often fail to select the best samples. The paper shows these methods are outperformed by simple heuristics like selecting questions that elicit long responses from an LLM.

Key Novelty

Systematic Ablation of Reasoning Data Pipelines (OpenThoughts)

Conduct over 1,000 controlled experiments to isolate the impact of each data curation step: question sourcing, mixing strategies, question filtering, answer deduplication, and teacher selection.
Establish empirical findings that contradict common intuition, such as 'diversity (mixing many sources) hurts performance' and 'stronger benchmarks do not mean better teachers'.
Scale the best-performing pipeline configuration to create OpenThoughts3-1.2M, a massive open-source reasoning dataset.

Architecture

The OpenThoughts3 data curation pipeline, illustrating the sequence of steps from sourcing to final dataset creation.

Evaluation Highlights

OpenThinker3-7B achieves 53% on AIME 2025, outperforming DeepSeek-R1-Distill-Qwen-7B by 15.3 percentage points.
On GPQA Diamond, OpenThinker3-7B reaches 54%, surpassing DeepSeek-R1-Distill-Qwen-7B by 20.5 percentage points.
Sampling 16 answers per question from the teacher increases dataset size and effectiveness more than mixing diverse question sources.

Breakthrough Assessment

9/10

Provides a comprehensive, empirical recipe for reasoning data curation that significantly advances the state of open-source models, outperforming major distilled baselines by large margins.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of language models on reasoning tasks (math, code, science)

Inputs: Questions q from various sources (synthetic, human-written)

Outputs: Reasoning traces (Chain-of-Thought) t and final answers a

Pipeline Flow

Question Sourcing (Select top sources per domain)
Question Mixing (Combine top 1-2 sources)
Question Filtering (Select best questions via difficulty/length)
Answer Generation (Sample multiple answers from Teacher)
Deduplication (Exact or Fuzzy matching)
Fine-tuning (Train Student Model)

System Modules

Question Sourcing

Identify high-quality seed questions for math, code, and science

Model or implementation: Various (GPT-4o-mini for synthetic generation)

Question Filter

Select the highest-quality subset of questions to label

Model or implementation: GPT-4o-mini (for difficulty scoring) or Teacher (for response length)

Teacher

Generate reasoning traces and answers for filtered questions

Model or implementation: QwQ-32B (selected over DeepSeek-R1)

Novel Architectural Elements

Pipeline-based ablation methodology: Treating the data curation process as a modular system where each component (sourcing, filtering, teacher) is independently optimized via controlled experiments.

Modeling

Base Model: Qwen2.5-7B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood of the target tokens (reasoning trace + answer).

Formally: Standard cross-entropy loss on next-token prediction.

Training Data:

OpenThoughts3-1.2M dataset
1.2M examples across math, code, and science
Generated using QwQ-32B teacher with best pipeline settings

Key Hyperparameters:

global_batch_size: 128
learning_rate: 2e-5
num_epochs: 1 (for final model), 3 (for ablation experiments)
+ 4 more
max_sequence_length: 16384
lr_scheduler: cosine
warmup_ratio: 0.03
weight_decay: 0.0

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1-Distill: OpenThoughts uses a systematically optimized data pipeline rather than just distilling one source, resulting in higher performance (e.g., +15.3% on AIME 2025).
vs. OpenR1/SkyT1: OpenThoughts employs extensive ablation (1,000+ experiments) to select optimal sourcing and filtering strategies, whereas others often rely on heuristics.
vs. Nemotron-Nano-8B: OpenThinker3-7B achieves higher average performance (54.5 vs 52.4) across 12 reasoning tasks.
+ 1 more
vs. DeepMath-103K [not cited in paper]: DeepMath focuses on math specifically, while OpenThoughts targets a broader multi-domain (math, code, science) capability.

Limitations

Experiments rely on a fixed budget (31,600 samples) for ablations, which might not fully predict scaling behaviors at larger data sizes.
The pipeline is optimized for SFT; suitability for Reinforcement Learning (RL) post-training is not explored.
Teacher model selection is limited to a few candidates (DeepSeek-R1, QwQ-32B, Phi-4); other proprietary models were not tested.

Reproducibility

Code: https://openthoughts.ai

All datasets (OpenThoughts-114K, OpenThoughts2-1M, OpenThoughts3-1.2M) and models (OpenThinker3-7B) are publicly available at openthoughts.ai. The paper details the exact pipeline steps and hyperparameters for SFT.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on reasoning benchmarks.

Benchmarks:

AIME 2025 (High-school math competition)
LiveCodeBench (06/24-01/25) (Code generation from recent contests)
GPQA Diamond (Graduate-level science QA)
MATH500 (Mathematics)
CodeElo (Competitive programming)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows OpenThinker3-7B significantly outperforming the direct competitor (DeepSeek-R1-Distill-Qwen-7B) and other open models on key reasoning benchmarks.
AIME 2025	Accuracy	37.7	53.0	+15.3
LiveCodeBench 06/24-01/25	Accuracy	33.8	51.0	+17.2
GPQA Diamond	Accuracy	33.5	54.0	+20.5
Average (12 tasks)	Accuracy	42.1	54.5	+12.4
Ablation studies reveal optimal strategies for question mixing and teacher selection.
Code Average	Accuracy	21.6	26.9	+5.3
Math Average	Accuracy	49.6	52.2	+2.6

Main Takeaways

Better teachers (higher benchmarks) don't always distill better students; QwQ-32B proved superior to DeepSeek-R1 for data generation.
Data quality trumps diversity: mixing just 1-2 top question sources outperforms mixing 16 diverse sources.
Simple filtering heuristics like 'longest response' or 'LLM-rated difficulty' outperform complex embedding-based filters.
Sampling multiple answers (16x) per question is a highly effective way to scale dataset utility, more so than adding more unique questions.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Knowledge Distillation
Chain-of-Thought (CoT) prompting

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on specific input-output pairs to adapt it to a task.

Distillation: Training a smaller student model to mimic the outputs of a larger, more capable teacher model.

Chain-of-Thought: A reasoning technique where models generate intermediate steps before producing a final answer.

DeepSeek-R1: A strong reasoning model used as a teacher for generating reasoning traces.

QwQ-32B: A reasoning model from the Qwen team, found in this paper to be a superior teacher despite lower benchmark scores.

AIME: American Invitational Mathematics Examination—a challenging math benchmark.

GPQA: Graduate-Level Google-Proof Q&A—a difficult science and reasoning benchmark.

fastText: A library for efficient text classification and representation learning.

LiveCodeBench: A benchmark for evaluating code generation capabilities, specifically on contest problems.