Evaluation Setup
Few-shot (4-shot) closed-book QA, measuring Exact Match (EM) accuracy against relevant document counts
Benchmarks:
- TriviaQA (Open-domain Factoid QA)
- Natural Questions (Open-domain Factoid QA)
Metrics:
- Exact Match (EM) Accuracy
- Relevant Document Count (Independent Variable)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Correlation analysis shows a strong log-linear relationship between the number of relevant documents in pre-training data and QA accuracy across multiple model families. |
| TriviaQA |
EM Accuracy |
25.0 |
55.0 |
+30.0
|
| Natural Questions |
Model Size |
100000000000 |
1000000000000000000 |
+999999900000000000
|
| TriviaQA |
Accuracy Drop |
0.14 |
0.02 |
-0.12
|
| Natural Questions |
EM Accuracy |
0.05 |
0.28 |
+0.23
|
Main Takeaways
- Strong log-linear relationship: QA accuracy is highly dependent on the number of times the fact appears in the pre-training data.
- Causal link confirmed: Removing relevant documents during training directly degrades performance on associated questions.
- Scaling is inefficient for the long tail: To learn rare facts via scaling alone requires prohibitively large models (e.g., 10^18 parameters).
- Retrieval is the solution: Retrieval-augmented models largely mitigate the dependence on pre-training frequency, maintaining high accuracy even for rare facts.