| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results on GPT-2 backbone showing SIM-CoT outperforming both implicit baselines and the explicit CoT baseline. | ||||
| GSM8k-Aug | Accuracy | 44.3 | 52.5 | +8.2 |
| GSM8k-Aug | Accuracy | 50.4 | 52.5 | +2.1 |
| Scaling results to larger LLaMA models, showing consistent improvements over the state-of-the-art implicit method CODI. | ||||
| GSM8k-Aug | Accuracy | Not reported in the paper | Not reported in the paper | +3.4 |
| GSM8k-Aug | Accuracy | Not reported in the paper | Not reported in the paper | +3.0 |
| Out-of-domain generalization results (averaged across OOD datasets). | ||||
| OOD Average (SVAMP, GSM-Hard, MultiArith) | Accuracy | Not reported in the paper | Not reported in the paper | +4.3 |