| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on code-davinci-002 showing Active-Prompt (using Disagreement metric) outperforms standard Self-Consistency and Auto-CoT across datasets. | ||||
| GSM8K | Accuracy | 60.1 | 65.6 | +5.5 |
| SVAMP | Accuracy | 76.4 | 80.4 | +4.0 |
| AQuA | Accuracy | 45.3 | 50.0 | +4.7 |
| CSQA | Accuracy | 73.5 | 76.2 | +2.7 |
| StrategyQA | Accuracy | 74.8 | 82.1 | +7.3 |
| Comparison of different uncertainty metrics on GSM8K using text-davinci-002. | ||||
| GSM8K | Accuracy | 47.1 | 52.3 | +5.2 |
| Comparison on text-davinci-003 showing generalization across models. | ||||
| GSM8K | Accuracy | 79.1 | 81.0 | +1.9 |
| SVAMP | Accuracy | 83.6 | 85.2 | +1.6 |