NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

📝 Paper Summary

Synthetic Data Generation Reasoning Benchmarks Knowledge Distillation

NaturalReasoning is a dataset of 2.8 million synthetic reasoning questions generated from pretraining corpora that enables more sample-efficient model training than existing math-focused datasets due to its diversity and complexity.

Core Problem

Existing reasoning datasets are limited to narrow domains (mostly math/coding) with short, easy-to-verify solutions, failing to cover broader, open-ended reasoning tasks required for general intelligence.

Why it matters:

Scaling reasoning beyond math and coding is hindered by a lack of diverse, high-quality questions
Current synthetic datasets (e.g., MetaMathQA) are derived from existing benchmarks, limiting novelty
Simple data scaling with low-quality or narrow data (like WebInstruct) yields diminishing returns or performance plateaus

Concrete Example: While OpenMathInstruct-2 trains models effectively for pure math (MATH benchmark), its narrow focus causes performance to plateau or fluctuate on broader scientific reasoning tasks like GPQA and MMLU-Pro, whereas NaturalReasoning improves across all three.

Key Novelty

NaturalReasoning (Backtranslated Reasoning Data)

Uses LLMs to annotate raw pretraining documents for 'reasoning traces' rather than just extracting existing questions
Synthesizes completely new, self-contained questions based on high-reasoning documents (backtranslation), creating novel problems not present in the source text
Generates reference answers and teacher responses (via Llama-3-70B) to support both supervised fine-tuning and reinforcement learning

Architecture

The data creation pipeline for NaturalReasoning

Evaluation Highlights

With only 1.5M samples, a Llama-3.1-8B model trained on NaturalReasoning outperforms the official Llama-3.1-8B-Instruct (trained on much more data) across averaged reasoning benchmarks.
93% of NaturalReasoning questions are rated 'high quality' by judge models, surpassing the next best dataset (OpenMathInstruct-2) which scored 79%.
Contains the longest median response length (434 words) compared to other datasets (e.g., OpenMathInstruct-2 at 46 words), indicating higher reasoning complexity.

Breakthrough Assessment

8/10

Significantly diversifies reasoning data beyond math/code. The method of synthesizing questions *about* complex documents rather than extracting them is a scalable, high-quality data recipe.

⚙️ Technical Details

Problem Definition

Setting: Synthetic dataset generation and Supervised Fine-Tuning (SFT) for reasoning

Inputs: Raw documents d from pretraining corpora (DCLM-baseline, FineMath)

Outputs: Synthetic reasoning question q and reference answer a

Pipeline Flow

Document Selection: LLM annotates pretraining docs for reasoning depth
Question Synthesis: LLM generates challenging question q based on high-scoring docs
Answer Verification: LLM verifies if answer a is derivable from doc d
Teacher Generation: Strong model (Llama-3-70B) generates detailed response
Filtration: Decontamination against benchmarks and deduplication

System Modules

Reasoning Annotator (Data Generation)

Identify documents containing sophisticated reasoning traces

Model or implementation: LLM (specific model not explicitly named for this step, likely Llama-3 class)

Question Synthesizer (Data Generation)

Compose self-contained reasoning questions based on document content

Model or implementation: LLM

Teacher Model (Data Generation)

Generate high-quality Chain-of-Thought responses for distillation

Model or implementation: Llama-3-70B-Instruct

Novel Architectural Elements

Pipeline emphasizes synthesizing *novel* questions grounded in document logic rather than extracting existing questions (unlike WebInstruct)

Modeling

Base Model: Llama-3.1-8B-Base and Qwen2.5-7B

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning

Training Data:

NaturalReasoning (2.8M pairs)
Comparisons: OpenMathInstruct-2, WebInstruct, NuminaMath, MetaMathQA

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 128
epochs: 3
+ 1 more
schedule: cosine (final LR is 1% of peak)

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenMathInstruct-2: NaturalReasoning covers diverse domains (Law, Econ, STEM) vs. pure Math, preventing plateauing on general reasoning benchmarks
vs. WebInstruct: Synthesizes novel questions via backtranslation vs. extracting existing Q-A pairs from crawled pages
vs. OpenThoughts [not cited in paper]: NaturalReasoning uses document-grounded synthesis from pre-training data vs. OpenThoughts which often relies on distilling from stronger models on existing prompts

Limitations

Evaluated primarily on math/science benchmarks (MATH, GPQA), less focus on creative writing or humanities
Reference answers generated by LLMs may contain noise (though 81.68% are verifiable against source)
Teacher model (Llama-3-70B) places an upper bound on the distilled reasoning capability

Reproducibility

Code: https://huggingface.co/datasets/facebook/natural_reasoning

Dataset released at HuggingFace. Training uses fairseq2 recipes. Specific prompts for data generation provided in Appendix I. Teacher model is open-weight Llama-3-70B-Instruct.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation using greedy decoding

Benchmarks:

MATH (Mathematics problems)
GPQA (Graduate-level science reasoning)
MMLU-Pro (Diverse multi-task language understanding)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Data efficiency scaling trends showing NaturalReasoning achieves better performance with fewer samples compared to baselines.
GPQA	Accuracy	25.37	30.0	+4.63
MATH	Accuracy	59.25	51.0	-8.25
Quality Check	High Quality %	79	93	+14
Response Length Proxy	Median Word Count	46	434	+388

Experiment Figures

Scaling trends (Accuracy vs Data Size) on MATH, GPQA, and MMLU-Pro for Llama-3.1-8B trained on different datasets

UMAP visualization of question embeddings comparing NaturalReasoning and WebInstruct

Main Takeaways

NaturalReasoning is more sample-efficient: a model trained on 1.5M samples outperforms the instruction-tuned teacher (Llama-3.1-8B-Instruct) on average across benchmarks.
Diversity matters: While OpenMathInstruct-2 wins on MATH, it plateaus on GPQA/MMLU-Pro. NaturalReasoning improves across all, showing better generalization.
Complexity proxy: The dataset elicits much longer chain-of-thought responses (median 434 words) than competitors, suggesting deeper reasoning requirements.
Topic coverage: Clustering analysis shows NaturalReasoning covers Law, Physics, and CS densely, whereas WebInstruct is skewed heavily toward Math.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation
Supervised Fine-Tuning (SFT)
Reinforcement Learning with Verifiable Rewards (RLVR)
Backtranslation for data augmentation

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset of labeled examples to adapt it to a downstream task

Backtranslation: In this context, generating a question that would result in the source document content as the answer, effectively reversing the generation process to create training data

Knowledge Distillation: Transferring knowledge from a large, capable 'teacher' model (e.g., Llama-3-70B) to a smaller 'student' model (e.g., Llama-3.1-8B)

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before producing the final answer

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness checks (like matching a reference answer) to guide model training

GPQA: A challenging QA benchmark requiring graduate-level reasoning in biology, physics, and chemistry

MMLU-Pro: An enhanced version of the MMLU benchmark designed to be more difficult and robust, covering diverse subjects

Greedy decoding: A generation strategy where the model always selects the highest-probability token at each step, ensuring deterministic output

Deduplication: Removing duplicate or near-duplicate entries from a dataset to prevent overfitting and ensure diversity

Self-training: A method where a model generates its own training data or rewards to improve its performance without external human labels