| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| QPaug significantly outperforms baseline methods (Chain-of-Thoughts, Rerank, Self-verification, SuRE) across all datasets using Contriever and GPT-3.5. | ||||
| Natural Questions (NQ) | F1 | 40.4 | 44.6 | +4.2 |
| HotpotQA | F1 | 33.6 | 45.1 | +11.5 |
| 2WikiMultihopQA | F1 | 32.6 | 35.5 | +2.9 |
| Ablation on Passage Generation (Pgen) shows that adding a self-generated passage consistently improves F1 scores, especially on multi-hop datasets where retrieval is difficult. | ||||
| 2WikiMultihopQA | F1 | 36.5 | 47.8 | +11.3 |
| Ablation on Question Augmentation (Qaug) shows substantial improvements in retrieval recall. | ||||
| HotpotQA | Recall@10 | 47.47 | 62.08 | +14.61 |