← Back to Paper List

The Impact of Post-training on Data Contamination

Muhammed Yusuf Kocyigit, Caglar Yildirim
Boston University, Northeastern University
arXiv (2026)
Pretraining RL Reasoning Benchmark

📝 Paper Summary

Data Contamination LLM Evaluation Model Memorization vs Generalization
Post-training revives dormant data contamination from pre-training; supervised fine-tuning causes simple memorization, while reinforcement learning translates leaked data into broader, more generalizable capabilities.
Core Problem
Evaluations assume strict separation between training and test data, but recent studies reveal pervasive pre-training data contamination whose downstream impact after modern post-training remains poorly understood.
Why it matters:
  • Most contamination analyses focus exclusively on models immediately after pre-training, ignoring that deployed models undergo SFT (Supervised Fine-Tuning) or RL (Reinforcement Learning)
  • Post-training paradigms inject strong task-specific signals that can materially reshape representations, potentially amplifying, exploiting, or erasing dormant pre-training leakage
  • Without life-cycle evaluations, researchers risk misrepresenting the true real-world impact of contamination and deploying ineffective mitigation strategies
Concrete Example: If a model is exposed to GSM8K (a math benchmark) test questions during pre-training, continued pre-training on clean data masks this leakage. However, when the model later undergoes SFT on GSM8K training data, it 'remembers' the leaked test set, artificially inflating evaluation scores compared to a clean model without actually improving underlying math reasoning.
Key Novelty
End-to-End Life-Cycle Contamination Audit
  • Injects benchmark test sets into early pre-training and continues training on a large clean corpus to accurately mimic real-world latent contamination
  • Applies clean SFT and GRPO (Group Relative Policy Optimization) to contaminated checkpoints to observe how different optimization objectives interact with leaked data
  • Compares performance gains on contaminated benchmarks versus uncontaminated counterparts to distinguish pure memorization from genuine generalization
Evaluation Highlights
  • Post-training resurrects hidden contamination signals, inflating contaminated benchmark scores by up to 4 points compared to clean baselines
  • SFT (Supervised Fine-Tuning) inflates scores strictly on contaminated tasks like GSM8K, exposing purely local memorization
  • GRPO (Group Relative Policy Optimization) improves performance on both contaminated tasks and uncontaminated tasks (e.g., GSMPlus), indicating better translation of leaked data into generalizable capabilities
  • As model scale increases (up to 4B parameters), SFT models exhibit greater relative over-estimation, whereas larger GRPO models channel capacity to dilute over-estimation
Breakthrough Assessment
8/10
Provides critical empirical evidence that contamination must be evaluated post-training, successfully isolating the divergent effects of SFT and RL on memorization and generalization.
×