| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on Natural Questions (NQ) showing RECITE outperforms direct prompting across multiple models. | ||||
| Natural Questions (NQ) | EM | 28.98 | 31.34 | +2.36 |
| Natural Questions (NQ) | EM | 31.45 | 35.84 | +4.39 |
| Results on TriviaQA showing consistent improvements, particularly for Codex. | ||||
| TriviaQA | EM | 81.84 | 83.50 | +1.66 |
| Multi-hop reasoning results on HotpotQA, comparing against Chain-of-Thought (CoT). | ||||
| HotpotQA | EM | 20.51 | 26.46 | +5.95 |
| HotpotQA | EM | 23.73 | 26.46 | +2.73 |
| Natural Questions (NQ) | EM | 31.34 | 33.23 | +1.89 |