Improving Pretraining Data Using Perplexity Correlations

📝 Paper Summary

Pretraining Data Selection Data-Centric AI Scaling Laws

High-quality pretraining data can be selected without training proxy models by identifying text domains where lower perplexity in existing public LLMs correlates with better downstream benchmark scores.

Core Problem

Selecting optimal pretraining data typically requires training many expensive proxy models to evaluate different data mixtures, while cheaper heuristic methods often underperform.

Why it matters:

Training runs for data selection are prohibitively expensive (costing millions for large scale), limiting who can perform data research
Current lightweight methods (like deduplication or simple classifiers) do not consistently match the performance of hand-curated datasets
As datasets grow to 240T+ tokens, identifying the high-value subsets is critical for model performance

Concrete Example: A researcher wants to train a model for science QA. Traditional methods require training multiple small models on different web dumps to see which works. This approach uses existing models (like Mistral, Llama) to see that when models find 'arxiv.org' predictable (low loss), they tend to score high on SciQ, so it selects 'arxiv.org' without training anything new.

Key Novelty

Perplexity Correlation-based Data Selection

Treats the population of existing open-weight LLMs as a statistical instrument rather than training new proxy models
Calculates the correlation between a model's test loss on a specific data domain (e.g., BBC, arXiv) and its score on a target benchmark
Selects data domains that have the strongest negative correlation (lower loss = higher benchmark score) to construct a new training distribution

Evaluation Highlights

Outperforms DSIR (a popular n-gram data selection method) on every benchmark in controlled 160M parameter experiments across 8 benchmarks
Matches the performance of the best hand-engineered classifier from DataComp-LM (OH-2.5 + ELI5 fastText) without any parameter tuning or human curation
Validated at 1.4B parameter scale on an aggregation of 22 benchmarks, showing performance gains increase with model scale

Breakthrough Assessment

8/10

Offers a radically cheaper alternative to traditional data selection by leveraging the 'sunk cost' of existing open models. Theoretically grounded and empirically competitive with heavy hand-tuned baselines.

⚙️ Technical Details

Problem Definition

Setting: Observational Data Selection

Inputs: A set of N pretrained LLMs, a collection of candidate text domains D, and a target benchmark

Outputs: A sampling distribution over domains D that minimizes expected downstream error

Pipeline Flow

Public Model Evaluation (Calculate Loss on Domains & Benchmark Scores)
Correlation Estimation (Compute Domain-Benchmark Correlations)
Data Selection (Construct Sampling Distribution)

System Modules

Public Model Evaluator

Generate the observational data matrix of losses and errors

Model or implementation: Sample of 90 LLMs from Open LLM Leaderboard (e.g., Mistral, Llama, Pythia)

Correlation Estimator

Identify which domains predict performance

Model or implementation: Single Index Model (SIM) estimator

Data Sampler

Create the final pretraining dataset

Model or implementation: Weighted Sampling

Novel Architectural Elements

Use of a heterogeneous population of 90 unrelated public LLMs as a 'sensor array' to estimate data quality without training a single proxy model

Modeling

Base Model: Validation models trained at 160M and 1.4B parameters

Training Method: Standard Causal Language Modeling Pretraining

Objective Functions:

Purpose: Minimize prediction error on the selected text.

Formally: Cross-entropy loss over the selected data distribution.

Adaptation: None (trained from scratch)

Trainable Parameters: 160M and 1.4B

Training Data:

Source: Tens of thousands of web domains
Selection: Weighted sampling based on Perplexity-Benchmark correlations

Key Hyperparameters:

parameter_scale_1: 160M
parameter_scale_2: 1.4B

Compute: Not reported in the paper

Comparison to Prior Work

vs. DSIR: Uses semantic signal (perplexity) from trained models rather than surface-level n-gram statistics
vs. DataComp-LM: Fully automated and data-driven selection without human-curated positive examples or hand-engineered classifiers
vs. Proxy Training methods (e.g., Data Models): Requires zero training of proxy models, saving massive compute resources

Limitations

Depends on the availability of a diverse set of public models; if public models are homogeneous, correlations may be weak
Requires access to the domains (or similar data) that the public models were evaluated on to compute correlations
Computational cost shifts from training proxy models to running inference on many public models over many domains
No specific breakdown of training costs or GPU hours provided for the data selection phase

Reproducibility

Code: https://github.com/TristanThrush/perplexity-correlations

Code is publicly available as a pip package. The method relies on public models from the Open LLM Leaderboard, making the inputs accessible. The paper uses 'tens of thousands of web domains' but exact source lists (e.g., specific Common Crawl snapshot IDs) would be needed for bit-exact reproduction.

📊 Experiments & Results

Evaluation Setup

Pretrain new models (160M and 1.4B) on data selected by the method and compare downstream zero-shot/few-shot performance.

Benchmarks:

SciQ (Science Question Answering)
MMLU (General Knowledge (mentioned in motivation))
Aggregate of 22 benchmarks (General Language Understanding (from DataComp-LM))

Metrics:

Accuracy
Statistical methodology: Pre-registered experiments for the 1.4B scale runs

Main Takeaways

Perplexity correlations are a reliable signal for data quality: domains where public models show lower loss consistently yield better training data for that benchmark.
The method scales effectively: improvements over baselines increase or hold steady when moving from 160M to 1.4B parameters.
Outperforms DSIR (n-gram based selection) consistently across all 8 tested benchmarks at 160M scale.
Achieves parity with state-of-the-art hand-tuned classifiers (DataComp-LM's fastText filter) without requiring any manual feature engineering or human curation.

📚 Prerequisite Knowledge

Prerequisites

Causal Language Modeling (CLM) objectives
Basic statistics (correlation coefficients)
Understanding of pretraining data mixtures

Key Terms

Perplexity: A measurement of how well a probability model predicts a sample; defined here via bits-per-byte (BPB)

DSIR: Data Selection for In-context Learning via Importance Resampling—a method using n-gram statistics to select data similar to a target distribution

Single Index Model (SIM): A statistical model where the output (error) is a nonlinear monotone function of a linear projection of the inputs (losses)

Bits-per-byte (BPB): A normalized version of negative log-likelihood based on the number of UTF-8 bytes in a sequence, allowing comparison across tokenizers

Open LLM Leaderboard: A public repository by Hugging Face tracking the performance of open-weight LLMs on various benchmarks

FastText: A library for efficient text classification and representation learning, often used as a baseline classifier for quality filtering