| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results comparing RetroLLM against baselines on standard Open-domain QA datasets using Llama-2-7B backbone. | ||||
| Natural Questions | EM | 47.7 | 51.15 | +3.45 |
| PopQA | EM | 32.0 | 49.16 | +17.16 |
| TriviaQA | EM | 51.3 | 69.15 | +17.85 |
| Performance on multi-hop reasoning datasets where retrieving the correct chain of evidence is critical. | ||||
| 2WikiMultihopQA | F1 | 33.7 | 41.14 | +7.44 |
| Out-of-domain generalization results where the model is trained on NQ and tested on other datasets. | ||||
| TriviaQA (Out-of-domain) | EM | 53.2 | 53.48 | +0.28 |