| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on Llama-2 7B showing the impact of different training strategies on absorbing knowledge from Wiki2023-film documents. | ||||
| Wiki2023-film-test-QA | EM Accuracy | 30.3 | 48.1 | +17.8 |
| Wiki2023-film-test-QA | EM Accuracy | 27.2 | 48.1 | +20.9 |
| Main comparison on Llama-2 70B showing consistent scaling of the PIT method. | ||||
| Wiki2023-film-test-QA | EM Accuracy | 46.4 | 62.7 | +16.3 |
| Cross-domain experiments showing generalization when training on one domain and testing on another. | ||||
| Wiki2023-film-test-QA | EM Accuracy | 30.3 | 38.8 | +8.5 |