| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results on Spider datasets show QDecomp+InterCOL outperforms both Standard prompting and other CoT styles (Chain-of-Thought, Least-to-Most). | ||||
| Spider Dev | Test-suite execution accuracy (TS) | 63.2 | 68.4 | +5.2 |
| Spider Dev | Test-suite execution accuracy (TS) | 66.0 | 68.4 | +2.4 |
| Spider Dev | Test-suite execution accuracy (TS) | 56.8 | 68.4 | +11.6 |
| Spider Realistic | Test-suite execution accuracy (TS) | 51.0 | 56.5 | +5.5 |
| Robustness check using 'Extra Hard' (G3) examples for in-context learning. | ||||
| Spider Dev | Test-suite execution accuracy (TS) | 58.2 | 68.8 | +10.6 |