← Back to Paper List

Improving Pretraining Data Using Perplexity Correlations

Tristan Thrush, Christopher Potts, Tatsunori Hashimoto
Department of Computer Science, Stanford University
arXiv (2024)
Pretraining Benchmark

📝 Paper Summary

Pretraining Data Selection Data-Centric AI Scaling Laws
High-quality pretraining data can be selected without training proxy models by identifying text domains where lower perplexity in existing public LLMs correlates with better downstream benchmark scores.
Core Problem
Selecting optimal pretraining data typically requires training many expensive proxy models to evaluate different data mixtures, while cheaper heuristic methods often underperform.
Why it matters:
  • Training runs for data selection are prohibitively expensive (costing millions for large scale), limiting who can perform data research
  • Current lightweight methods (like deduplication or simple classifiers) do not consistently match the performance of hand-curated datasets
  • As datasets grow to 240T+ tokens, identifying the high-value subsets is critical for model performance
Concrete Example: A researcher wants to train a model for science QA. Traditional methods require training multiple small models on different web dumps to see which works. This approach uses existing models (like Mistral, Llama) to see that when models find 'arxiv.org' predictable (low loss), they tend to score high on SciQ, so it selects 'arxiv.org' without training anything new.
Key Novelty
Perplexity Correlation-based Data Selection
  • Treats the population of existing open-weight LLMs as a statistical instrument rather than training new proxy models
  • Calculates the correlation between a model's test loss on a specific data domain (e.g., BBC, arXiv) and its score on a target benchmark
  • Selects data domains that have the strongest negative correlation (lower loss = higher benchmark score) to construct a new training distribution
Evaluation Highlights
  • Outperforms DSIR (a popular n-gram data selection method) on every benchmark in controlled 160M parameter experiments across 8 benchmarks
  • Matches the performance of the best hand-engineered classifier from DataComp-LM (OH-2.5 + ELI5 fastText) without any parameter tuning or human curation
  • Validated at 1.4B parameter scale on an aggregation of 22 benchmarks, showing performance gains increase with model scale
Breakthrough Assessment
8/10
Offers a radically cheaper alternative to traditional data selection by leveraging the 'sunk cost' of existing open models. Theoretically grounded and empirically competitive with heavy hand-tuned baselines.
×