| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| ECHO consistently outperforms baselines across Arithmetic, Commonsense, and Symbolic reasoning tasks using GPT-3.5-Turbo. | ||||
| GSM8K | Accuracy | 81.6 | 83.3 | +1.7 |
| SVAMP | Accuracy | 79.8 | 82.5 | +2.7 |
| Coin Flip | Accuracy | 86.8 | 98.8 | +12.0 |
| StrategyQA | Accuracy | 63.3 | 69.1 | +5.8 |
| CommonsenseQA | Accuracy | 73.8 | 75.3 | +1.5 |