| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| In-distribution results show Tecton outperforms baselines on FuncQA and GSM8K-XL, with particularly large gains on multi-hop tasks. | ||||
| GSM8K-XL | Accuracy | 47.9 | 55.1 | +7.2 |
| GSM8K-XL | Accuracy | 52.8 | 55.1 | +2.3 |
| FuncQA-MH | Accuracy | 10.1 | 20.6 | +10.5 |
| Out-of-distribution (OOD) results demonstrate Tecton's superior generalization to unseen datasets compared to baselines trained on the same data. | ||||
| ASDiv-XL | Accuracy | 51.1 | 59.2 | +8.1 |
| MAWPS-XL | Accuracy | 44.6 | 52.3 | +7.7 |
| SVAMP-XL | Accuracy | 45.0 | 54.7 | +9.7 |
| Ablation studies confirm the necessity of bias calibration and dynamic retrieval. | ||||
| FuncQA-MH | Accuracy | 13.2 | 20.6 | +7.4 |