Data, Data Everywhere: A Guide for Pretraining Dataset Construction

📝 Paper Summary

Pretraining dataset creation Data curation and filtering Data mixing and sampling

This paper systematically ablates the pretraining data pipeline—curation, selection, and sampling—and demonstrates that leveraging granular data attributes like quality and domain significantly improves downstream model performance.

Core Problem

Leading language model developers do not disclose their pretraining data construction methods, leaving the community without actionable guidelines on how to curate, select, and sample data effectively.

Why it matters:

Pretraining datasets are the primary driver of recent LM capabilities (e.g., GPT-4, Gemini), yet recipes remain trade secrets
Ineffective data filtering (e.g., aggressive toxicity removal) can inadvertently discard high-quality text, degrading model performance
Standard heuristics for data sampling often fail to balance diverse domains effectively compared to systematic approaches

Concrete Example: When filtering for toxicity, a naive classifier might remove news articles about 'sensitive subjects' (e.g., war, protests) because they contain toxic words. This paper shows these documents are often high quality; removing them degrades performance. Using a target set of 'Low Toxicity, High Quality' retains these valuable documents.

Key Novelty

Systematic Pretraining Pipeline Ablation & Attribute-Aware Construction

Conducts the first large-scale ablation study across the entire pretraining pipeline (curation, selection, sampling) rather than just one component
Analyzes 90+ Common Crawl snapshots to categorize web data by domain, quality, and speech type, revealing that technical domains are scarce while news/blogs dominate
Proposes using these attribute labels to create fine-grained sampling buckets, improving model accuracy over standard source-based sampling

Architecture

The complete pretraining dataset development pipeline

Evaluation Highlights

+1.29 average accuracy improvement on English benchmarks using UniMax sampling compared to preference-based weighting
+1.07 accuracy improvement using fine-grained Quality-based sampling buckets compared to a baseline without attribute information
Prioritizing older documents during deduplication outperforms random selection by +0.51 points

Breakthrough Assessment

8/10

Provides a rare, comprehensive empirical guide to pretraining data construction with actionable recipes, addressing a major transparency gap in the field.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised language model pretraining on large-scale text corpora

Inputs: Raw heterogeneous text sources (Web crawl, code, books, papers)

Outputs: A curated, filtered, and sampled pretraining dataset {x_i} used to minimize autoregressive loss

Pipeline Flow

Data Curation (Deduplication & Quality Filtering)
Data Selection (DSIR)
Data Sampling (Weight assignment)

System Modules

Data Curation

Remove ill-formed, duplicate, and low-quality documents

Model or implementation: KenLM (n-gram model) + MinHash LSH (deduplication)

Data Selection

Select a subset of data that matches a high-quality target distribution

Model or implementation: DSIR (Domain Selection via Importance Resampling)

Data Sampling

Assign sampling weights to different data sources/buckets for training

Model or implementation: UniMax / Alpha Sampling

Novel Architectural Elements

Integration of attribute classifiers (Quality, Domain, Toxicity) directly into the sampling stage to create fine-grained mixing buckets

Modeling

Base Model: Decoder-only transformer LMs (2B and 8B parameters)

Trainable Parameters: 2B or 8B parameters

Training Data:

English: Web crawl (889B), Misc (109B), News (94B), etc.
Multilingual: Web crawl (540B), Parallel corpora (56B)
Code: The Stack v1.2 (212B)

Key Hyperparameters:

alpha_sampling_value: 0.3 (general) or 1.3 (code)
unimax_epochs: 1 (best for English)
dsir_selection_rate: 95%

Comparison to Prior Work

vs. DoReMi: Finds DoReMi fails to produce competitive weights for 8B models (often skewing to single sources), whereas UniMax/Alpha are robust
vs. RefinedWeb: Extends analysis beyond curation to selection and sampling; analyzes 90+ snapshots vs RefinedWeb's smaller scope
vs. Moore-Lewis [not cited in paper]: Uses DSIR (a modern approximation) and shows source-level application is superior to corpus-level

Limitations

Study limited to specific model scales (2B/8B) and may not fully extrapolate to frontier models (100B+)
DoReMi implementation underperformed significantly, possibly due to proxy model limitations
Did not evaluate synthetic data sources
Does not disclose the exact training data for the attribute classifiers (DeBERTaV3)

Reproducibility

Data sources are publicly described (Common Crawl, The Stack). Specific classifier weights for attributes are not linked. Code for the pipeline is not provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot accuracy on downstream benchmarks after pretraining

Benchmarks:

LM-Evaluation Harness (General English tasks (PIQA, HellaSwag, Winogrande, etc.))
MMLU (Multi-task Language Understanding)
HumanEval (Python Code Generation)
TyDiQA-GoldP (Multilingual Question Answering)

Metrics:

Accuracy (Zero-shot)
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Curation ablations show that filtering and deduplication strategies significantly impact performance.
LM-Eval (Avg)	Accuracy	57.18	59.50	+2.32
LM-Eval (Avg)	Accuracy	59.96	60.47	+0.51
Sampling method comparisons reveal UniMax as superior for English, while Alpha sampling favors Code.
LM-Eval (Avg)	Accuracy	65.85	67.14	+1.29
HumanEval	Pass@1	20.12	20.73	+0.61
Attribute-based interventions demonstrate that using fine-grained data attributes improves sampling and selection.
LM-Eval (Avg)	Accuracy	56.81	57.88	+1.07
LM-Eval (Avg)	Accuracy	54.90	55.63	+0.73

Experiment Figures

Distribution of document types in web crawl (Common Crawl)

Distribution of content domains in web crawl

Main Takeaways

Deduplication should prioritize older documents; preserving older data yields better downstream accuracy than favoring recent data
UniMax is the most robust sampling method for English and Multilingual data, preventing overfitting to high-resource domains
Web crawl is dominated by Homepages, News, and Blogs; technical domains (Law, Science) are rare and must be upsampled or augmented
Data attributes (Quality, Domain) allow for finer-grained sampling buckets that significantly outperform broad source-level sampling
Learned sampling methods like DoReMi can be unstable and perform worse than heuristics like UniMax or Alpha sampling in practice

📚 Prerequisite Knowledge

Prerequisites

Understanding of the pretraining pipeline (deduplication, tokenization, training)
Familiarity with n-gram language models for filtering
Basic knowledge of sampling strategies (alpha sampling)

Key Terms

Common Crawl (CC): A massive, open repository of web crawl data used as the primary source for LLM pretraining

DSIR: Domain Selection via Importance Resampling—a method to select data from a raw corpus that matches the distribution of a high-quality target corpus

KenLM: A library for efficient n-gram language modeling, often used to filter low-quality text based on perplexity

UniMax: A sampling strategy that caps the number of epochs (repetitions) for any data source to ensure fair representation of low-resource domains

Alpha sampling: A heuristic sampling method where the probability of sampling a dataset is proportional to its size raised to the power of alpha (smoothing the distribution)

DoReMi: Domain Reweighting with Minimax Optimization—a method that uses a proxy model to learn optimal data sampling weights

Perplexity: A measurement of how well a probability model predicts a sample; high perplexity in filtering often indicates low-quality or unnatural text

DeBERTaV3: A transformer-based model used here as a classifier to label data attributes (toxicity, quality, domain) across the corpus