The Impact of Post-training on Data Contamination

📝 Paper Summary

Data Contamination LLM Evaluation Model Memorization vs Generalization

Post-training revives dormant data contamination from pre-training; supervised fine-tuning causes simple memorization, while reinforcement learning translates leaked data into broader, more generalizable capabilities.

Core Problem

Evaluations assume strict separation between training and test data, but recent studies reveal pervasive pre-training data contamination whose downstream impact after modern post-training remains poorly understood.

Why it matters:

Most contamination analyses focus exclusively on models immediately after pre-training, ignoring that deployed models undergo SFT (Supervised Fine-Tuning) or RL (Reinforcement Learning)
Post-training paradigms inject strong task-specific signals that can materially reshape representations, potentially amplifying, exploiting, or erasing dormant pre-training leakage
Without life-cycle evaluations, researchers risk misrepresenting the true real-world impact of contamination and deploying ineffective mitigation strategies

Concrete Example: If a model is exposed to GSM8K (a math benchmark) test questions during pre-training, continued pre-training on clean data masks this leakage. However, when the model later undergoes SFT on GSM8K training data, it 'remembers' the leaked test set, artificially inflating evaluation scores compared to a clean model without actually improving underlying math reasoning.

Key Novelty

End-to-End Life-Cycle Contamination Audit

Injects benchmark test sets into early pre-training and continues training on a large clean corpus to accurately mimic real-world latent contamination
Applies clean SFT and GRPO (Group Relative Policy Optimization) to contaminated checkpoints to observe how different optimization objectives interact with leaked data
Compares performance gains on contaminated benchmarks versus uncontaminated counterparts to distinguish pure memorization from genuine generalization

Evaluation Highlights

Post-training resurrects hidden contamination signals, inflating contaminated benchmark scores by up to 4 points compared to clean baselines
SFT (Supervised Fine-Tuning) inflates scores strictly on contaminated tasks like GSM8K, exposing purely local memorization
GRPO (Group Relative Policy Optimization) improves performance on both contaminated tasks and uncontaminated tasks (e.g., GSMPlus), indicating better translation of leaked data into generalizable capabilities
As model scale increases (up to 4B parameters), SFT models exhibit greater relative over-estimation, whereas larger GRPO models channel capacity to dilute over-estimation

Breakthrough Assessment

8/10

Provides critical empirical evidence that contamination must be evaluated post-training, successfully isolating the divergent effects of SFT and RL on memorization and generalization.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the impact of data contamination across the LLM training life-cycle, comparing clean vs. contaminated models after both pre-training and post-training stages

Inputs: Clean vs. contaminated pre-trained checkpoints, post-training datasets (SFT/RL)

Outputs: Performance gap on contaminated vs. uncontaminated benchmarks to measure over-estimation

Pipeline Flow

Extended Pre-training (Contaminated/Clean) → Post-training (SFT/GRPO) → Evaluation

System Modules

Extended Pre-training (Training)

Train base models on 25B tokens with or without test set contamination injected in the early stages

Model or implementation: Qwen2.5 (0.5B, 1.5B) or Gemma3 (1B, 4B)

Post-training (Training)

Apply task-specific fine-tuning or alignment using strictly clean training splits

Model or implementation: SFT or GRPO applied to pre-trained checkpoints

Evaluation

Measure model performance to calculate the contamination-induced generalization gap

Model or implementation: LM Evaluation Harness + math-verify library

Modeling

Base Model: Qwen2.5 (0.5B, 1.5B) and Gemma3 (1B, 4B)

Training Method: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO)

Training Data:

25B token pre-training mixture: FineWeb-Edu (web text), CodeParrots (code), OpenMath-Instruct (math)
Contamination: 5 copies of GSM8K and MBPP test sets injected into first 2B tokens
Clean SFT/GRPO performed on corresponding GSM8K and MBPP training sets

Compute: Not reported in the paper

Comparison to Prior Work

vs. Pre-training studies: Focuses on the life-cycle of contamination through modern post-training (SFT/RL) rather than evaluating exclusively immediately after pre-training
vs. Post-training memorization metrics: Scales analysis to complex generative reasoning tasks (math, coding) and larger models (up to 4B parameters) using RL/GRPO, rather than small BERT models on classification SFT

Limitations

Restricted to relatively small model sizes (up to 4B parameters) and two open-weights families
Simulates a specific late-injection contamination scenario (5 copies in early pre-training), which may not reflect real-world multi-pass or paraphrased leakage
Uses synthetic rule-based rewards for RL; human-annotated preference signals might yield different alignment behaviors

Reproducibility

The paper relies on public datasets (FineWeb-Edu, CodeParrots, OpenMath-Instruct, GSM8K, MBPP, GSMPlus, HumanEval) and open-weights models (Qwen2.5, Gemma3). However, code repositories, exact training hyperparameters, and final model weights are not provided in the text. Relies on the standard LM Evaluation Harness.

📊 Experiments & Results

Evaluation Setup

Comparing performance gaps between clean and contaminated models on math and coding benchmarks across different training life-cycle stages

Benchmarks:

GSM8K (Contaminated math reasoning benchmark)
MBPP (Contaminated Python coding benchmark)
GSMPlus (Uncontaminated math reasoning benchmark)
HumanEval (Uncontaminated Python coding benchmark)

Metrics:

Accuracy (Math)
Pass@1 (Coding)
Statistical methodology: Standard error propagation assuming independent estimates for the (Contaminated–Clean) difference, yielding 95% confidence intervals

Experiment Figures

Contaminated vs. clean model performance on GSM8K during the pre-training process for Qwen2.5-1.5B

Performance movement mapped across contaminated vs. uncontaminated benchmarks comparing Base, SFT, and GRPO models

The impact of model scale on the contamination-gap difference across different training recipes (Base, SFT, GRPO)

Main Takeaways

Continued pre-training on clean data successfully masks the advantage of a contaminated model, driving the apparent performance gap close to zero (confirming prior work).
Post-training resurrects hidden contamination, yielding a performance gap of over 2% (up to 4%) in favor of the contaminated model.
SFT (Supervised Fine-Tuning) uncovers pre-training contamination more strongly than GRPO (Group Relative Policy Optimization) for most models, but its gains are purely local to the contaminated benchmark, signifying pure over-estimation.
GRPO improves performance on both contaminated and uncontaminated benchmarks, suggesting it successfully extracts generalizable reasoning patterns from the leaked data.
As model scale increases, SFT models suffer from progressively larger over-estimation, while larger GRPO models convert leakage into broader generalization, thereby diluting the relative over-estimation.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with LLM pre-training and post-training paradigms
Understanding of data contamination and benchmark leakage

Key Terms

SFT: Supervised Fine-Tuning—training a model on labeled instruction-response pairs to teach it specific formats or tasks

RLHF: Reinforcement Learning from Human Feedback—a technique to align models using human preferences

GRPO: Group Relative Policy Optimization—a reinforcement learning method that optimizes policies using relative advantages within a group of sampled responses

Data contamination: Direct or near-duplicate overlap between benchmark evaluation examples and the corpora used during model training

GSM8K: A popular dataset of grade-school math word problems used to evaluate LLM reasoning

MBPP: Mostly Basic Python Problems—a benchmark for evaluating basic Python coding capabilities

GSMPlus: An uncontaminated math benchmark created from GSM8K via adversarial edits, used to measure true generalization

HumanEval: A high-quality Python coding benchmark used as an uncontaminated counterpart to MBPP

Base model: An LLM that has only undergone pre-training on raw text, without task-specific fine-tuning or alignment

Over-estimation: When a model scores higher on a benchmark due to memorizing leaked test data rather than possessing the underlying capability