← Back to Paper List

Investigating Data Contamination for Pre-training Language Models

M Jiang, KZ Liu, M Zhong, R Schaeffer, S Ouyang…
University of Illinois Urbana-Champaign, Stanford University
arXiv, 1/2024 (2024)
Pretraining Benchmark

📝 Paper Summary

LLM Pre-training Evaluation methodology
By pre-training GPT-2 models from scratch with controlled data leakage, this paper reveals that ground-truth contamination significantly inflates performance and that repetition effects are surprisingly U-shaped.
Core Problem
Current understanding of data contamination relies on post-hoc n-gram filtering of evaluation sets, which fails to capture the actual impact of contamination during pre-training, especially regarding ground-truth leakage.
Why it matters:
  • Capabilities of LLMs may be overestimated if performance gains are driven by memorizing leaked evaluation data rather than genuine generalization
  • Existing n-gram definitions (like those in PaLM or Llama 2) generate high false positives/negatives, making safety claims unreliable
  • The impact of 'ground truth' leakage (prompts + answers) versus simple text leakage is largely unexplored
Concrete Example: A model might memorize the specific prompt and answer for a SQuAD question seen during pre-training, artificially boosting its F1 score on that benchmark without actually understanding the passage.
Key Novelty
Controlled Pre-training with Intentional Contamination
  • Pre-trains GPT-2 models from scratch on a clean corpus with deliberately injected evaluation data to measure the exact causal impact of contamination
  • Distinguishes between 'text contamination' (input text only) and 'ground-truth contamination' (input + prompt + answer) to isolate the effect of label leakage
  • Investigates the 'contamination factor' (repetition count), discovering a non-linear relationship between leakage frequency and model performance
Evaluation Highlights
  • Ground-truth contamination boosts GPT-2-small performance on CNN/DailyMail ROUGE-L significantly (23.99) compared to the original model (16.94)
  • Repeated contamination shows a U-shaped trend: performance on SST-2 drops from ~65% at 5 repetitions to ~50% at 20 repetitions, contradicting the assumption that more leakage always equals better scores
  • Standard n-gram filtering removes up to 30% of data labeled 'contaminated' without significantly degrading performance, suggesting these definitions are inaccurate
Breakthrough Assessment
7/10
Provides critical empirical evidence challenging standard contamination assumptions (U-shaped repetition curve) and highlights the severe impact of ground-truth leakage, though limited by smaller model scale (GPT-2).
×