Evaluation Setup
Zero-shot reasoning on arithmetic, commonsense, and symbolic tasks
Benchmarks:
- MultiArith (Arithmetic Reasoning)
- GSM8K (Arithmetic Reasoning)
- AddSub (Arithmetic Reasoning)
- AQuA (Arithmetic Reasoning)
- SVAMP (Arithmetic Reasoning)
- SingleEq (Arithmetic Reasoning)
- CommonsenseQA (Commonsense Reasoning)
- StrategyQA (Commonsense Reasoning)
- Last Letter (Symbolic Reasoning)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Arithmetic Reasoning: EoT consistently outperforms standard Zero-shot CoT and the enhanced Plan-and-Solve (PS+) prompting across diverse math benchmarks using GPT-3.5-turbo. |
| GSM8K |
Accuracy |
75.1 |
78.3 |
+3.2
|
| SVAMP |
Accuracy |
77.8 |
83.1 |
+5.3
|
| AQuA |
Accuracy |
50.0 |
55.5 |
+5.5
|
| Comparison with Few-Shot: Surprisingly, Zero-shot EoT outperforms Few-shot Manual-CoT (which uses 8 examples) on arithmetic tasks, suggesting dynamic prompting can replace static demonstrations. |
| GSM8K |
Accuracy |
56.5 |
78.3 |
+21.8
|
| SingleEq |
Accuracy |
92.3 |
96.5 |
+4.2
|
| Commonsense & Symbolic Reasoning: EoT shows improvements on logic and commonsense tasks, though gains on StrategyQA are more modest. |
| Last Letter |
Accuracy |
78.8 |
83.8 |
+5.0
|
| CommonsenseQA |
Accuracy |
72.6 |
75.8 |
+3.2
|
Main Takeaways
- Zero-shot EoT consistently outperforms static Zero-shot CoT and PS+ prompting across all 10 tested datasets
- Dynamic prompt evolution allows zero-shot methods to surpass few-shot methods (Manual-CoT) in arithmetic reasoning, eliminating the need for manual example engineering
- The combination of evolutionary prompt generation, selection, and problem rewriting effectively handles the sensitivity of LLMs to prompt phrasing
- Effectiveness is particularly strong in arithmetic and symbolic tasks compared to commonsense reasoning