| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Single-step retrieval results show MS-RAG significantly outperforming baselines, especially on HotpotQA. | ||||
| HotpotQA | Recall@2 | 59.0 | 77.6 | +18.6 |
| HotpotQA | Recall@2 | 69.5 | 79.4 | +9.9 |
| Average (3 datasets) | Recall@2 | 57.2 | 66.2 | +9.0 |
| Multi-step retrieval results (using IRCoT) confirm MS-RAG provides better documents for reasoning chains. | ||||
| Average (3 datasets) | Recall@2 | 66.2 | 68.2 | +2.0 |
| Average (3 datasets) | Recall@2 | 53.7 | 71.8 | +18.1 |
| QA quality evaluation shows MS-RAG generates more correct and comprehensive answers than GraphRAG. | ||||
| Mixed datasets (100 samples) | Correctness | 27.7 | 72.3 | +44.6 |