| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results comparing Contrastive CoT against standard Prompting and conventional Chain-of-Thought (CoT) using GPT-3.5-Turbo. | ||||
| GSM8K | Accuracy | 69.2 | 79.0 | +9.8 |
| Bamboogle | Accuracy | 40.8 | 56.8 | +16.0 |
| StrategyQA | Accuracy | 55.8 | 66.2 | +10.4 |
| SVAMP | Accuracy | 67.2 | 81.6 | +14.4 |
| Results when combining prompting methods with Self-Consistency (SC), a decoding strategy that takes the majority vote of multiple outputs. | ||||
| GSM8K | Accuracy | 71.0 | 86.2 | +15.2 |
| Bamboogle | Accuracy | 40.8 | 58.4 | +17.6 |