| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| IRCoT significantly improves retrieval recall compared to one-step retrieval across all four datasets. | ||||
| HotpotQA | Recall | 40.9 | 62.3 | +21.4 |
| 2WikiMultihopQA | Recall | 39.4 | 51.8 | +12.4 |
| IRCoT leads to substantial gains in downstream QA F1 scores compared to one-step retrieval baselines. | ||||
| HotpotQA | F1 | 46.3 | 59.2 | +12.9 |
| MuSiQue | F1 | 24.2 | 39.5 | +15.3 |
| Smaller models using IRCoT can outperform much larger models using standard retrieval. | ||||
| HotpotQA | F1 | 46.3 | 49.6 | +3.3 |