| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| DATER significantly outperforms baselines on TabFact, including human performance when combined with fine-tuned models. | ||||
| TabFact | Accuracy | 92.1 | 93.0 | +0.9 |
| TabFact | Accuracy | 72.6 | 85.6 | +13.0 |
| TabFact | Accuracy | 85.1 | 85.6 | +0.5 |
| DATER achieves state-of-the-art results on WikiTableQuestion, showing strong generalization to complex questions. | ||||
| WikiTableQuestion | Accuracy | 61.9 | 65.9 | +4.0 |
| WikiTableQuestion | Accuracy | 47.6 | 65.9 | +18.3 |
| Ablation studies confirm that both evidence and question decomposition are critical. | ||||
| WikiTableQuestion | Accuracy | 61.4 | 65.9 | +4.5 |
| TabFact | Accuracy | 81.8 | 85.6 | +3.8 |