← Back to Paper List

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

Syeda Nahida Akter, Shrimai Prabhumoye, Eric Nyberg, M. Patwary, M. Shoeybi, Yejin Choi, Bryan Catanzaro
NVIDIA
arXiv.org (2025)
Pretraining Reasoning RL

📝 Paper Summary

LLM Pretraining Reasoning data scaling Data curriculum
Injecting reasoning data during pretraining creates a foundational advantage that later supervised fine-tuning cannot replicate, with pretraining benefiting most from diversity and fine-tuning from high quality.
Core Problem
Current LLM development often treats reasoning as a specialized skill added during post-training (SFT/RL), but it is unclear if this late injection is optimal compared to incorporating reasoning data during pretraining.
Why it matters:
  • The community focuses on post-training due to the high cost of pretraining experiments, leaving a knowledge gap about early-stage data synergy
  • If pretraining establishes a ceiling on reasoning capability, then optimizing only post-training data recipes creates fundamentally limited models
  • Blindly scaling SFT data without a strong pretrained reasoning foundation might be inefficient or even detrimental to model performance
Concrete Example: A model pretrained only on general data (Common Crawl) might fail to solve complex math problems even after intensive fine-tuning, whereas a model exposed to diverse reasoning patterns during pretraining can unlock significantly higher performance with the same fine-tuning.
Key Novelty
Asymmetric Data Allocation Strategy
  • Demonstrates that pretraining and SFT have different optimal data compositions: pretraining requires scale and diversity (broad exposure), while SFT requires high quality and complexity (precise refinement)
  • Identifies a 'latent effect' where high-quality pretraining data yields minimal immediate gains but significantly boosts the effectiveness of subsequent alignment/SFT
Evaluation Highlights
  • +19% average gain on expert-level benchmarks when reasoning data is front-loaded into pretraining compared to adding it only during post-training
  • Pretraining with diverse reasoning data yields +11% gain, while SFT on high-quality data yields +15%, confirming an asymmetric optimal strategy
  • Naive scaling of mixed-quality SFT data degrades mathematical reasoning by -5% on average, whereas high-quality data consistently improves it
Breakthrough Assessment
9/10
Provides the first systematic, controlled study of reasoning data allocation across full pretraining and post-training, challenging the dominant industry paradigm of 'general pretraining + reasoning finetuning' and offering a clear new recipe.
×