| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Overall accuracy comparison showing the dominance of closed-source models and code-based prompting. | ||||
| DataBench_lite | Average Accuracy | 33.1 | 63.0 | +29.9 |
| DataBench_lite | Average Accuracy | 33.4 | 63.0 | +29.6 |
| Performance breakdown by answer type highlights specific weaknesses in list processing. | ||||
| DataBench_lite | Accuracy (List[Number]) | 1.6 | 56.5 | +54.9 |
| DataBench_lite | Accuracy (Boolean) | 50.0 | 52.7 | +2.7 |
| Complexity analysis showing performance drop when reasoning over multiple columns. | ||||
| DataBench_lite | Accuracy (Multiple Cols) | 67.0 | 57.4 | -9.6 |