| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on GSM8K using Llama-3-8B-Instruct shows SEAG outperforms baselines in both accuracy and efficiency. | ||||
| GSM8K | Accuracy | 0.825 | 0.860 | +0.035 |
| GSM8K | Number of inferences | 128.40 | 41.69 | -86.71 |
| GSM8K | Accuracy | 0.785 | 0.860 | +0.075 |
| Performance on ARC using Llama-3-8B-Instruct demonstrating generalization to commonsense reasoning. | ||||
| ARC | Accuracy | 0.812 | 0.848 | +0.036 |
| Ablation results on GSM8K (Llama-2-13B) showing impact of components. | ||||
| GSM8K | Accuracy | 0.403 | 0.435 | +0.032 |