Evaluation Setup
Zero-shot or few-shot evaluation on standard NLP benchmarks after pre-training from scratch
Benchmarks:
- SST-2 (Sentiment Analysis)
- MMLU (Multi-task NLU)
- CNN/DailyMail (Summarization)
- SQuAD v1 (Reading Comprehension (QA))
Metrics:
- Accuracy
- ROUGE-1/2/L
- UniEval (Coherence, Consistency, Fluency, Relevance)
- F1 Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of different contamination types (Original vs. Text-only vs. Ground-Truth) shows that ground-truth contamination generally provides larger gains. |
| CNN/DailyMail |
ROUGE-L |
16.94 |
23.99 |
+7.05
|
| SQuAD |
F1 |
18.39 |
47.24 |
+28.85
|
| SST-2 |
Accuracy |
50.92 |
59.98 |
+9.06
|
| MMLU |
Accuracy |
25.96 |
26.39 |
+0.43
|
| Scaling experiments with GPT-2-large confirm trends hold for larger models. |
| CNN/DailyMail |
ROUGE-L |
18.89 |
28.53 |
+9.64
|
| MMLU |
Accuracy |
26.96 |
28.91 |
+1.95
|
Main Takeaways
- Ground-truth contamination (prompts+answers) significantly boosts performance on generation tasks (SQuAD, CNN/DM) compared to text-only contamination.
- Repetition of contamination has a U-shaped effect: moderate repetition (~5-10x) improves performance, but excessive repetition (20x+) degrades it below baseline.
- Current n-gram based contamination definitions (PaLM, Llama 2) are insufficient; filtering data based on them does not consistently impact performance, suggesting high false positives.
- Fluency (UniEval) correlates more with training data size/repetitions than with contamination type, unlike correctness metrics (ROUGE, F1).