| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Zero-shot performance comparisons on arithmetic reasoning datasets. PS+ consistently beats standard Zero-shot-CoT. | ||||
| GSM8K | Accuracy | 56.4 | 59.3 | +2.9 |
| MultiArith | Accuracy | 83.8 | 91.8 | +8.0 |
| SVAMP | Accuracy | 69.9 | 75.7 | +5.8 |
| AQuA | Accuracy | 38.9 | 46.0 | +7.1 |
| Comparison against Few-shot methods. PS+ is competitive with Manual-CoT despite being zero-shot. | ||||
| Average (6 Math Datasets) | Accuracy | 77.6 | 76.7 | -0.9 |
| Symbolic and Commonsense reasoning results. | ||||
| CommonsenseQA | Accuracy | 65.2 | 71.9 | +6.7 |
| Last Letter | Accuracy | 64.8 | 75.2 | +10.4 |
| Ablation on Self-Consistency (SC) showing PS+ scales well with ensemble decoding. | ||||
| GSM8K | Accuracy (w/ SC) | 70.7 | 73.7 | +3.0 |