| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| ATM consistently outperforms baselines on standard QA benchmarks when facing fabricated documents. | ||||
| Natural Questions | Exact Match (EM) | 45.02 | 51.17 | +6.15 |
| TriviaQA | Exact Match (EM) | 53.68 | 55.05 | +1.37 |
| WebQuestions | Exact Match (EM) | 24.95 | 29.77 | +4.82 |
| PopQA | F1 Score | 46.22 | 47.96 | +1.74 |
| Ablation study confirms the necessity of both fabrication generation and list permutation. | ||||
| Natural Questions | Exact Match (EM) | 48.25 | 51.17 | +2.92 |