| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on Open-Domain QA tasks showing improvements over SFT and baselines. | ||||
| Natural Questions | EM | 46.22 | 49.76 | +3.54 |
| TriviaQA | EM | 69.09 | 71.94 | +2.85 |
| Natural Questions | EM | 48.23 | 49.76 | +1.53 |
| Results on Fact Verification tasks demonstrating robustness. | ||||
| PubHealth | Accuracy | 65.35 | 73.18 | +7.83 |
| Ablation study analyzing the contribution of optimizing each module. | ||||
| Natural Questions | EM | 45.04 | 49.76 | +4.72 |
| Natural Questions | EM | 49.56 | 49.76 | +0.20 |