Evaluation Setup
Inject synthetic facts into pretraining stream and monitor log-probability evolution
Benchmarks:
- Fictional Knowledge Probes (Cloze-style completion) [New]
Metrics:
- Log Probability (of target span)
- Effectivity (immediate learning magnitude)
- Retainability (fraction of learning retained over time)
- Statistical methodology: IQR outlier detection (factor 1.5) applied to metrics
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The paper finds that pretraining stage does not significantly impact the immediate ability to acquire knowledge (Effectivity), but model size does. |
| Fictional Knowledge |
Effectivity |
Qualitatively similar to Late Stage |
Qualitatively similar to Late Stage |
Insignificant difference
|
| Fictional Knowledge |
Effectivity |
Lower magnitude |
Higher magnitude (OLMo-7B) |
Positive
|
| Forgetting follows a power-law relationship, and larger batch sizes reduce the rate of forgetting. |
| Fictional Knowledge |
Retainability Trend |
N/A |
Power-law fit |
N/A
|
| Fictional Knowledge |
Retainability |
Faster forgetting rate |
Slower forgetting rate (Batch Size 2048) |
Positive retention
|
Main Takeaways
- Acquisition happens via small probability bumps that are diluted by subsequent updates; knowledge is only 'learned' if the accumulation outpaces the power-law forgetting.
- Data duplication accelerates forgetting of specific instances compared to deduplicated data streams.
- Larger batch sizes improve knowledge retention, suggesting a trade-off between compute efficiency and knowledge stability.
- The 'Long-tail' problem is explained by the fact that rare concepts appear too infrequently to overcome the power-law forgetting dilution.