← Back to Paper List

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, et al.
DatologyAI
arXiv (2026)
Pretraining Reasoning Benchmark

📝 Paper Summary

Domain Adaptation Pretraining Strategies Continual Learning / Forgetting
Interleaving domain-specific data throughout pretraining (Specialized Pretraining) yields better performance and less overfitting than the standard practice of reserving domain data exclusively for finetuning.
Core Problem
Standard domain adaptation treats pretraining and finetuning as disjoint phases, reserving specialized data for finetuning. This often leads to rapid overfitting on the small domain dataset and catastrophic forgetting of general knowledge.
Why it matters:
  • Organizations often rely on finetuning for proprietary data (legal, medical), assuming it is the most efficient path, but this may yield suboptimal models compared to early data integration.
  • Finetuning on small corpora requires aggressive updates that degrade general capabilities, while pretraining models from scratch is often viewed as too expensive.
  • Current scaling laws do not account for the trade-offs between repeated domain data during pretraining versus finetuning.
Concrete Example: A 1B model trained with standard pretraining (Web data) followed by finetuning on 'ProofPile' (math) overfits rapidly after ~5 epochs. In contrast, a model that sees ProofPile mixed into pretraining (SPT) sustains performance improvement for far longer and matches the performance of a 3B standard model.
Key Novelty
Specialized Pretraining (SPT)
  • Mix a small fraction of domain-specific data (e.g., 2%) into the general pretraining corpus from the start, repeating it as necessary (up to ~50x), rather than saving it for finetuning.
  • Derives 'overfitting scaling laws' that model test loss as a sum of learning (power law) and overfitting (gap growing with repetitions), allowing prediction of optimal data mixing ratios.
Evaluation Highlights
  • On the 'ProofPile' domain, a 1B parameter SPT model outperforms a 3B parameter standard model, effectively closing >100% of the performance gap.
  • SPT reduces the pretraining tokens needed to reach a specific domain loss by up to 1.75x compared to standard pretraining (on MusicPile).
  • Improves downstream accuracy by up to 6 percentage points on MATH and 4 percentage points on MusicTheoryBench compared to the finetuning-only baseline.
Breakthrough Assessment
8/10
Challenges the standard industry practice of 'pretrain then finetune' for domain adaptation. Provides actionable scaling laws and demonstrates that smaller, specialized models can beat larger general models.
×