| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Zero-shot PS+ consistently outperforms standard Zero-shot-CoT across arithmetic reasoning datasets. | ||||
| Average (6 Math Datasets) | Accuracy | 70.4 | 76.7 | +6.3 |
| GSM8K | Accuracy | 56.4 | 59.3 | +2.9 |
| CommonsenseQA (CSQA) | Accuracy | 65.2 | 71.9 | +6.7 |
| Last Letter | Accuracy | 64.8 | 75.2 | +10.4 |
| PS+ Prompting performs comparably to or better than Program-of-Thought (PoT) and Few-shot methods on arithmetic tasks. | ||||
| Average (6 Math Datasets) | Accuracy | 73.5 | 76.7 | +3.2 |
| Average (6 Math Datasets) | Accuracy | 77.6 | 76.7 | -0.9 |