MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

📝 Paper Summary

Scientific Reasoning Post-training Datasets Instruction Tuning

The paper introduces TextbookReasoning and MegaScience, two large-scale, high-quality scientific reasoning datasets curated from textbooks and optimized public data, which enable base models to outperform official instruct models.

Core Problem

Existing open-source scientific reasoning resources are underdeveloped compared to math/coding, suffering from unreliable benchmarks, weak decontamination, low-quality web-scraped answers, and superficial distillation that promotes overthinking.

Why it matters:

Current benchmarks often use multiple-choice formats that inflate performance scores without reflecting true computational reasoning ability
Existing decontamination (n-gram overlap) is fragile, leading to significant benchmark leakage in post-training datasets
Distilling long Chain-of-Thought data from models like DeepSeek-R1 often leads to 'overthinking' and inefficiently long responses for simple scientific queries

Concrete Example: Models trained on existing multiple-choice datasets (e.g., Nemotron-Science) exhibit inflated performance on multiple-choice evaluations but struggle significantly with computational tasks, showing a disconnect between benchmark scores and actual reasoning.

Key Novelty

Dual-Source Curation with Difficulty-Based Selection

Extracts 650k questions directly from 12k university textbooks (TextbookReasoning) using a dual-standard pipeline to ensure truthful, expert-written reference answers rather than relying solely on potentially hallucinatory LLM generations
Constructs a massive mixture (MegaScience) by applying specific selection strategies—keeping all textbook data while filtering public datasets via difficulty scoring and response length—to maximize training efficiency

Architecture

The data curation pipeline for TextbookReasoning, detailing the flow from PDF collection to final decontaminated dataset.

Evaluation Highlights

MegaScience-trained Qwen2.5-7B outperforms the official Qwen2.5-7B-Instruct by +5.2 average score across 15 scientific benchmarks
MegaScience-trained Llama-3.1-8B surpasses the official Llama-3.1-8B-Instruct by +5.1 average score
Models trained on MegaScience generate responses with significantly fewer tokens (721 tokens) compared to distillation baselines like NaturalReasoning (1,155 tokens) while achieving better performance

Breakthrough Assessment

8/10

Significant contribution to the under-served scientific domain. The move away from pure web-scraping to textbook extraction and the rigorous decontamination pipeline addresses major reliability issues in current science AI.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) for Scientific Reasoning

Inputs: Scientific reasoning questions q (physics, biology, chemistry, etc.)

Outputs: Reasoning steps and final answer a

Pipeline Flow

Textbook Processing: PDF → Text → Q-A Extraction → Refinement
Public Data Processing: Selection (Length/Difficulty) → Solution Annotation
Merging & Filtering: Deduplication → Decontamination → Final Dataset

System Modules

Extractor (Textbook Processing)

Extract Question-Answer pairs from textbook chunks using dual criteria (high/low standard)

Model or implementation: Llama3.3-70B-Instruct

Refiner (Textbook Processing)

Improve Q-A pairs by adding reasoning steps and fixing missing information based on source text

Model or implementation: DeepSeek-V3

Decontaminator

Remove questions that overlap with 15 downstream benchmarks

Model or implementation: BGE-large-en-v1.5 (retrieval) + Llama3.3-70B-Instruct (verification)

Selector

Select high-quality subsets from public datasets (NaturalReasoning, Nemotron-Science)

Model or implementation: Qwen2.5-32B-Instruct (as judge for difficulty)

Novel Architectural Elements

Dual-standard extraction strategy for textbooks (extracting both high-complexity and general questions to ensure coverage)
LLM-based decontamination pipeline using retrieval + zero-shot paraphrase detection rather than just n-gram overlap
Hybrid data selection strategy: retaining all textbook data while applying difficulty/random selection to public web data

Modeling

Base Model: Llama-3.1-8B, Qwen2.5-7B, Qwen2.5-14B, Qwen2.5-32B, Qwen2.5-72B, Qwen2.5-Math-7B, Qwen2.5-Math-72B

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

TextbookReasoning: 651k questions from 12.8k textbooks
MegaScience: 1.25M instances (TextbookReasoning + subsets of NaturalReasoning and Nemotron-Science)

Key Hyperparameters:

global_batch_size: 128
learning_rate: 2e-5 (for 7B/8B models), 1e-5 (for larger models)
lr_scheduler: cosine
+ 3 more
warmup_ratio: 0.03
sequence_length: 8192 (for 7B/8B), 4096 (for others)
epochs: 3

Compute: 8 × H800 GPUs

Comparison to Prior Work

vs. NaturalReasoning: MegaScience uses textbook data as a core component rather than just web data, ensuring higher truthfulness.
vs. Nemotron-Science: MegaScience avoids the multiple-choice format for training to prevent performance inflation and improve actual computational reasoning.
vs. Distillation-only approaches: MegaScience creates shorter, more concise responses (721 tokens) compared to raw distillation from reasoning models (often >1k tokens) while maintaining higher performance.

Limitations

The evaluation framework relies on specific answer extraction strategies which might still miss some valid formats.
The decontamination process, while rigorous, relies on the retrieval quality of BGE-large; if the retriever misses a paraphrase, the LLM verifier won't see it.
Requires significant compute (H800s) for the data curation pipeline (processing 12k textbooks).

Reproducibility

Code: https://github.com/GAIR-NLP/lm-open-science-evaluation

publicly available (https://github.com/GAIR-NLP/lm-open-science-evaluation). Code for data curation, evaluation system, datasets, and seven trained models are released. Specific prompt templates for extraction and refinement are provided in the appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation across diverse scientific disciplines

Benchmarks:

MMLU (General knowledge (STEM subset))
GPQA (Graduate-level science QA)
SciBench (Complex scientific computation)
MATH (Mathematics problems)
PubMedQA (Biomedical QA)

Metrics:

Accuracy (Acc)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MegaScience consistently improves performance over official Instruct models across Llama and Qwen families on the average of 15 science benchmarks.
Average (15 benchmarks)	Acc	58.1	63.2	+5.1
Average (15 benchmarks)	Acc	63.2	68.4	+5.2
Average (15 benchmarks)	Acc	73.2	78.2	+5.0
SciBench	Acc	13.6	57.7	+44.1
Ablation studies show TextbookReasoning alone is a strong contender, often beating other single sources.
Average (15 benchmarks)	Acc	61.6	67.0	+5.4
Average (15 benchmarks)	Acc	56.4	67.0	+10.6

Main Takeaways

Textbooks provide higher information density and reliability than web-scraped data, leading to better post-training performance.
The 'Difficulty Selection' method works best for filtering noisy web datasets (Nemotron), while 'Random Selection' suffices for higher-quality pools (NaturalReasoning).
Training on MegaScience yields shorter, more concise responses (avg 721 tokens) compared to distillation baselines (avg 1155 tokens) while achieving higher accuracy, indicating more efficient reasoning.
Scaling benefit: MegaScience shows effectiveness not just for small models but also for larger models (72B), maintaining significant gains over official instruct versions.

📚 Prerequisite Knowledge

Prerequisites

Instruction Tuning / Supervised Fine-Tuning (SFT)
Chain of Thought (CoT) prompting
Data decontamination techniques
LLM distillation

Key Terms

TextbookReasoning: A new dataset of 650k reasoning questions extracted from 12k university-level textbooks with truthful reference answers

MegaScience: A composite dataset of 1.25M instances combining TextbookReasoning with filtered subsets of public datasets (NaturalReasoning, Nemotron-Science)

SFT: Supervised Fine-Tuning—training a pre-trained base model on labeled examples to follow instructions

Decontamination: The process of removing training data that overlaps with test benchmarks to prevent cheating; this paper uses embedding similarity + LLM verification

CoT: Chain of Thought—prompting models to generate intermediate reasoning steps before the final answer

DeepSeek-R1: A strong reasoning model used in this paper to generate or refine solutions for the datasets

Locality-sensitive min-hashing: A technique used for deduplicating text data by efficiently estimating the similarity between sets

Pass@1: An evaluation metric measuring the percentage of problems where the model's first generated answer is correct