Evaluation Setup
Few-shot prompting on 10 reasoning datasets across arithmetic, commonsense, and symbolic tasks.
Benchmarks:
- GSM8K (Arithmetic Reasoning)
- CSQA (Commonsense Reasoning)
- Letter Concatenation (Symbolic Reasoning)
- AQuA (Arithmetic Reasoning)
- SVAMP (Arithmetic Reasoning)
Metrics:
- Exact Match Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Iter-CoT outperforms baselines on arithmetic reasoning tasks using GPT-3.5-turbo. |
| GSM8K |
Accuracy |
77.6 |
80.8 |
+3.2
|
| GSM8K |
Accuracy |
77.5 |
80.8 |
+3.3
|
| AQuA |
Accuracy |
60.6 |
68.5 |
+7.9
|
| Average (10 datasets) |
Accuracy |
77.7 |
81.5 |
+3.8
|
| Ablation studies confirm the necessity of the bootstrapping (correction) and summarization phases. |
| GSM8K |
Accuracy |
68.4 |
80.8 |
+12.4
|
| GSM8K |
Accuracy |
78.3 |
80.8 |
+2.5
|
Main Takeaways
- Selecting 'challenging yet answerable' questions (those the model initially fails but can correct) creates better demonstrations than random or purely complex selection.
- Iterative self-correction combined with summarization produces cleaner, more robust reasoning chains than single-pass generation.
- The method generalizes well across model sizes (Llama-2-70B to GPT-4) and task types (Arithmetic, Commonsense, Symbolic).
- Even without ground truth labels (using GPT-4 as a judge), Iter-CoT achieves performance competitive with the labeled version.