| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Impact of Question Prompting on PopQA-TP accuracy and coherence across multiple models. | ||||
| PopQA-TP | EM (Exact Match) | 40.29 | 46.95 | +6.66 |
| PopQA-TP | Coherence | 53.21 | 81.21 | +28.00 |
| PopQA-TP | EM (Exact Match) | 54.20 | 62.73 | +8.53 |
| Comparison of retrieval sources (Questions vs Paragraphs) on Open Domain QA (Mixtral). | ||||
| Open Domain QA (Avg across NQ, Quora, PAQ, TriviaQA) | Correctness | 0.72 | 0.74 | +0.02 |
| Comparison of Support Question generation methods (Retrieval vs Generation). | ||||
| Open Domain QA | Correctness | 78.0 | 79.3 | +1.3 |