| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Open-RAG outperforms baselines on Multi-hop Reasoning tasks, demonstrating the effectiveness of the MoE architecture for complex queries. | ||||
| HotpotQA | Accuracy | 31.9 | 37.5 | +5.6 |
| 2WikiMultiHopQA | Accuracy | 27.4 | 29.7 | +2.3 |
| Performance on Single-hop Short-form QA shows Open-RAG matches or beats proprietary models. | ||||
| PopQA | Accuracy | 48.2 | 57.8 | +9.6 |
| PubHealth | Accuracy | 69.0 | 73.2 | +4.2 |
| Long-form generation tasks show Open-RAG's ability to maintain factuality. | ||||
| Bio (Biography Generation) | FactScore | 70.2 | 73.9 | +3.7 |