| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Pattern-CoT consistently outperforms baselines on LLaMA-2-7B across various reasoning tasks. | ||||
| MultiArith | Accuracy | 90.8 | 94.7 | +3.9 |
| SVAMP | Accuracy | 63.7 | 69.7 | +6.0 |
| Coin-Flip | Accuracy | 46.4 | 53.4 | +7.0 |
| GSM8K | Accuracy | 39.4 | 40.9 | +1.5 |
| Ablation study on demonstration subsets shows that covering the full set of operations is crucial. | ||||
| GSM8K | Accuracy | 38.5 | 40.9 | +2.4 |
| MultiArith | Accuracy | 89.3 | 95.5 | +6.2 |