| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results demonstrating MA-RAG (8B) outperforming same-size and larger baselines on Open-Domain QA benchmarks. | ||||
| Natural Questions (NQ) | EM | 48.2 | 52.5 | +4.3 |
| HotpotQA | EM | 41.6 | 51.1 | +9.5 |
| 2WikimQA | EM | 43.3 | 46.4 | +3.1 |
| State-of-the-art results using larger/stronger backbone models (Llama3-70B and GPT-4o-mini). | ||||
| Natural Questions (NQ) | EM | 53.6 | 59.5 | +5.9 |
| HotpotQA | EM | 50.3 | 52.1 | +1.8 |
| Generalization to Medical Domain without fine-tuning. | ||||
| MedMCQA | Accuracy | 54.6 | 60.2 | +5.6 |