Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

📝 Paper Summary

LLM Pretraining Reasoning data scaling Data curriculum

Injecting reasoning data during pretraining creates a foundational advantage that later supervised fine-tuning cannot replicate, with pretraining benefiting most from diversity and fine-tuning from high quality.

Core Problem

Current LLM development often treats reasoning as a specialized skill added during post-training (SFT/RL), but it is unclear if this late injection is optimal compared to incorporating reasoning data during pretraining.

Why it matters:

The community focuses on post-training due to the high cost of pretraining experiments, leaving a knowledge gap about early-stage data synergy
If pretraining establishes a ceiling on reasoning capability, then optimizing only post-training data recipes creates fundamentally limited models
Blindly scaling SFT data without a strong pretrained reasoning foundation might be inefficient or even detrimental to model performance

Concrete Example: A model pretrained only on general data (Common Crawl) might fail to solve complex math problems even after intensive fine-tuning, whereas a model exposed to diverse reasoning patterns during pretraining can unlock significantly higher performance with the same fine-tuning.

Key Novelty

Asymmetric Data Allocation Strategy

Demonstrates that pretraining and SFT have different optimal data compositions: pretraining requires scale and diversity (broad exposure), while SFT requires high quality and complexity (precise refinement)
Identifies a 'latent effect' where high-quality pretraining data yields minimal immediate gains but significantly boosts the effectiveness of subsequent alignment/SFT

Evaluation Highlights

+19% average gain on expert-level benchmarks when reasoning data is front-loaded into pretraining compared to adding it only during post-training
Pretraining with diverse reasoning data yields +11% gain, while SFT on high-quality data yields +15%, confirming an asymmetric optimal strategy
Naive scaling of mixed-quality SFT data degrades mathematical reasoning by -5% on average, whereas high-quality data consistently improves it

Breakthrough Assessment

9/10

Provides the first systematic, controlled study of reasoning data allocation across full pretraining and post-training, challenging the dominant industry paradigm of 'general pretraining + reasoning finetuning' and offering a clear new recipe.

⚙️ Technical Details

Problem Definition

Setting: Optimization of downstream accuracy P as a function of reasoning data allocation between pretraining and SFT

Inputs: Reasoning data budget B distributed into D_res_PT (pretraining) and D_res_SFT (finetuning)

Outputs: Final model parameters θ_final maximizing performance on reasoning tasks T

Pipeline Flow

Phase 1: Pretraining (1T tokens, varying reasoning data mix)
Phase 2: Supervised Fine-Tuning (SFT) (varying quality/scale)
Phase 3: Reinforcement Learning (RLVR) (using GRPO)

System Modules

Base Pretraining Model

Learn general language and reasoning patterns from scratch

Model or implementation: Hybrid Transformer-Mamba2 (8B parameters)

SFT Adapter (Post-Training)

Specialize the pretrained model on instruction-following reasoning tasks

Model or implementation: Same architecture as base

RL Optimizer (Post-Training)

Further optimize reasoning using verifiability rewards

Model or implementation: Same architecture as base

Novel Architectural Elements

Hybrid Transformer-Mamba2 architecture (mixture of Mamba 2, self-attention, and FFN layers) used as the testbed for data scaling laws

Modeling

Base Model: Hybrid Transformer-Mamba2 (8B parameters)

Training Method: Supervised Fine-Tuning followed by RL (GRPO)

Trainable Parameters: All parameters (trained from scratch)

Training Data:

Base Corpus: 6.2T tokens (Common Crawl, math, code)
D_LDQ (Large Diverse): 336B tokens (Nemotron-Pretraining-SFT-v1)
D_SHQ (Small High-Quality): 1.2M examples (Guha et al., 2025)
D_LMQ (Large Mixed): Union of D_LDQ and D_SHQ
D_ALF (Answer-Length Filtered): Subset of D_LLQ with answers > 4096 tokens

Key Hyperparameters:

pretraining_tokens: 1 trillion
reasoning_data_ratio: 20% of pretraining mix

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Post-Training: Integrating reasoning data at scale (20%) from the start of pretraining rather than just at the end
vs. Intermediate Injection: End-to-end pretraining analysis rather than mid-stage injection; proving the 'catch-up' hypothesis false

Limitations

Study limited to 8B parameter scale; scaling laws might differ for 70B+ models
Uses a specific hybrid Mamba-Transformer architecture, results may vary for pure Transformers
Computational cost prevents exploring all possible data permutations
Exact composition of proprietary base corpus (Nemotron sources) is not fully public

Reproducibility

Datasets described conceptually (Nemotron-Pretraining-SFT-v1, Guha et al. 2025). Code and model weights not provided. Exact hyperparameters for optimization (LR, batch size) not detailed.

📊 Experiments & Results

Evaluation Setup

Evaluation of pretrained, SFT, and RL models on downstream reasoning benchmarks

Benchmarks:

Math Competitions (Mathematical Reasoning)
Scientific QA (Science Reasoning)
Code (Software Engineering)
General Reasoning (Broad reasoning tasks)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	This Paper	Δ
Expert-level benchmarks (Average)	Accuracy Gain	19.0	+19.0
Reasoning Tasks (Pretraining Phase)	Accuracy Gain	11.0	+11.0
Reasoning Tasks (SFT Phase)	Accuracy Gain	15.0	+15.0
Downstream Accuracy	Accuracy Gain	4.0	+4.0
Mathematical Reasoning	Accuracy Change	-5.0	-5.0

Main Takeaways

Front-loading reasoning data is essential; SFT cannot 'catch up' to a model pretrained with reasoning foundations.
Asymmetric principle: Pretraining benefits from Diversity/Scale, SFT benefits from Quality/Complexity.
Naive scaling of SFT data (more is better) is harmful; quality filters are critical for the post-training stage.
High-quality data in pretraining has a 'latent' effect—its value is fully realized only after alignment (SFT).

📚 Prerequisite Knowledge

Prerequisites

LLM training pipeline (Pretraining -> SFT -> RL)
Concept of Chain-of-Thought (CoT) reasoning
Understanding of data quality vs. diversity trade-offs

Key Terms

SFT: Supervised Fine-Tuning—adapting a pretrained model using labeled instruction-response pairs

RLVR: Reinforcement Learning with Verifiable Rewards—an RL technique where the model is rewarded based on objectively correct answers (e.g., math solutions)

Front-loading: The strategy of introducing specific data types (here, reasoning data) early in the pretraining phase rather than waiting for fine-tuning

Chain-of-Thought (CoT): A prompting or data style where the model produces intermediate reasoning steps before the final answer

Latent effect: A phenomenon where pretraining data improves the model's potential to learn during fine-tuning, even if the pretraining performance itself doesn't show immediate large gains

Mamba 2: A state-space model architecture used here in a hybrid configuration with Transformers to balance efficiency and performance

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies based on the relative performance of a group of outputs