Investigating Data Contamination for Pre-training Language Models

📝 Paper Summary

LLM Pre-training Evaluation methodology

By pre-training GPT-2 models from scratch with controlled data leakage, this paper reveals that ground-truth contamination significantly inflates performance and that repetition effects are surprisingly U-shaped.

Core Problem

Current understanding of data contamination relies on post-hoc n-gram filtering of evaluation sets, which fails to capture the actual impact of contamination during pre-training, especially regarding ground-truth leakage.

Why it matters:

Capabilities of LLMs may be overestimated if performance gains are driven by memorizing leaked evaluation data rather than genuine generalization
Existing n-gram definitions (like those in PaLM or Llama 2) generate high false positives/negatives, making safety claims unreliable
The impact of 'ground truth' leakage (prompts + answers) versus simple text leakage is largely unexplored

Concrete Example: A model might memorize the specific prompt and answer for a SQuAD question seen during pre-training, artificially boosting its F1 score on that benchmark without actually understanding the passage.

Key Novelty

Controlled Pre-training with Intentional Contamination

Pre-trains GPT-2 models from scratch on a clean corpus with deliberately injected evaluation data to measure the exact causal impact of contamination
Distinguishes between 'text contamination' (input text only) and 'ground-truth contamination' (input + prompt + answer) to isolate the effect of label leakage
Investigates the 'contamination factor' (repetition count), discovering a non-linear relationship between leakage frequency and model performance

Evaluation Highlights

Ground-truth contamination boosts GPT-2-small performance on CNN/DailyMail ROUGE-L significantly (23.99) compared to the original model (16.94)
Repeated contamination shows a U-shaped trend: performance on SST-2 drops from ~65% at 5 repetitions to ~50% at 20 repetitions, contradicting the assumption that more leakage always equals better scores
Standard n-gram filtering removes up to 30% of data labeled 'contaminated' without significantly degrading performance, suggesting these definitions are inaccurate

Breakthrough Assessment

7/10

Provides critical empirical evidence challenging standard contamination assumptions (U-shaped repetition curve) and highlights the severe impact of ground-truth leakage, though limited by smaller model scale (GPT-2).

⚙️ Technical Details

Problem Definition

Setting: Pre-training language models on corpora with controlled injection of evaluation datasets

Inputs: Pre-training corpus (The Pile subset) + Injected evaluation data (SST-2, MMLU, CNN/DM, SQuAD)

Outputs: Pre-trained Language Model (GPT-2)

Pipeline Flow

Corpus Preparation (Subsampling The Pile)
Contamination Injection (Text vs. Ground Truth, varied repetitions)
Pre-training (GPT-2 from scratch)
Evaluation (Downstream tasks)

System Modules

Data Injector

Injects evaluation samples into the pre-training corpus

Model or implementation: Script-based

Pre-training Engine

Trains the language model on the contaminated corpus

Model or implementation: GPT-2-small (124M) / GPT-2-large (774M)

Modeling

Base Model: GPT-2-small (124M) and GPT-2-large (774M)

Training Method: Pre-training from scratch (Causal Language Modeling)

Training Data:

1.95M documents from The Pile (3.3B tokens) for GPT-2-small
19.8B tokens from pile-uncopyrighted for GPT-2-large

Key Hyperparameters:

model_architecture: GPT-2
scaling_law: Chinchilla optimal

Compute: Not reported in the paper

Comparison to Prior Work

vs. PaLM/Llama 2: This paper modifies pre-training data to measure causal impact, whereas PaLM/Llama 2 only perform post-hoc analysis on existing models
vs. Membership Inference: Focuses on downstream performance impact rather than just detection of membership
vs. Min-K% Prob [not cited in paper]: Focuses on pre-training manipulation rather than detection via probability curvature

Limitations

Experiments primarily use GPT-2 (small/large), which is much smaller than modern LLMs
Limited to four specific datasets (SST-2, MMLU, CNN/DM, SQuAD)
Does not explore Instruction Tuning or RLHF stages, only pre-training
U-shaped repetition curve mechanism is hypothesized but not theoretically proven

Reproducibility

Datasets are public (SST-2, MMLU, CNN/DM, SQuAD). Pile dataset used for training. Code URL not provided in text. Exact training compute/time not reported.

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot evaluation on standard NLP benchmarks after pre-training from scratch

Benchmarks:

SST-2 (Sentiment Analysis)
MMLU (Multi-task NLU)
CNN/DailyMail (Summarization)
SQuAD v1 (Reading Comprehension (QA))

Metrics:

Accuracy
ROUGE-1/2/L
UniEval (Coherence, Consistency, Fluency, Relevance)
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different contamination types (Original vs. Text-only vs. Ground-Truth) shows that ground-truth contamination generally provides larger gains.
CNN/DailyMail	ROUGE-L	16.94	23.99	+7.05
SQuAD	F1	18.39	47.24	+28.85
SST-2	Accuracy	50.92	59.98	+9.06
MMLU	Accuracy	25.96	26.39	+0.43
Scaling experiments with GPT-2-large confirm trends hold for larger models.
CNN/DailyMail	ROUGE-L	18.89	28.53	+9.64
MMLU	Accuracy	26.96	28.91	+1.95

Experiment Figures

Line charts showing model performance (Accuracy, ROUGE, F1) vs. Contamination Factor (number of repetitions) for SST-2, MMLU, SQuAD, and CNN/DM.

Performance of models trained on corpora cleansed using different n-gram contamination definitions (Llama 2, n-gram overlap) vs. % of tokens removed.

Main Takeaways

Ground-truth contamination (prompts+answers) significantly boosts performance on generation tasks (SQuAD, CNN/DM) compared to text-only contamination.
Repetition of contamination has a U-shaped effect: moderate repetition (~5-10x) improves performance, but excessive repetition (20x+) degrades it below baseline.
Current n-gram based contamination definitions (PaLM, Llama 2) are insufficient; filtering data based on them does not consistently impact performance, suggesting high false positives.
Fluency (UniEval) correlates more with training data size/repetitions than with contamination type, unlike correctness metrics (ROUGE, F1).

📚 Prerequisite Knowledge

Prerequisites

Language Model Pre-training (Next token prediction)
Data Contamination concepts
N-gram overlap metrics
Standard NLP benchmarks (SQuAD, MMLU, etc.)

Key Terms

text contamination: Leaking only the input text (e.g., the passage in a reading comprehension task) into the pre-training corpus without labels

ground-truth contamination: Leaking the input text PLUS the specific prompt and the correct answer/label into the pre-training corpus

contamination factor: The number of times a specific piece of evaluation data is repeated in the pre-training corpus

n-gram overlap: A method to detect contamination by checking if a sequence of 'n' tokens in the evaluation set also appears in the training set

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation

UniEval: A unified multi-dimensional evaluator for text generation tasks measuring aspects like coherence and fluency

Chinchilla scaling laws: Empirical laws determining the optimal ratio of model parameters to training tokens for efficient compute usage