| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison showing CoRAG's performance against baselines across multiple QA datasets. Note that CoRAG is trained *only* on PopQA. | ||||
| PopQA | Accuracy | 66.2 | 71.2 | +5.0 |
| TriviaQA | Accuracy | 78.4 | 81.0 | +2.6 |
| Natural Questions (NQ) | Accuracy | 59.3 | 72.4 | +13.1 |
| 2WikiMultiHopQA | Accuracy | 42.7 | 58.2 | +15.5 |
| Ablation studies isolating the impact of joint training vs. individual component training. | ||||
| PopQA | Accuracy | 63.5 | 71.2 | +7.7 |
| PopQA | Accuracy | 51.8 | 71.2 | +19.4 |
| Cross-domain generalization results on code and table tasks. | ||||
| HumanEval+ | Pass@1 | 56.1 | 62.2 | +6.1 |