| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| ESSAM consistently outperforms standard ES and matches or exceeds RL baselines across various model sizes. | ||||
| GSM8K | Average Accuracy (All Models) | 75.97 | 78.27 | +2.30 |
| GSM8K | Average Accuracy (All Models) | 77.72 | 78.27 | +0.55 |
| GSM8K | Accuracy | Not reported as exact number in summary text but ESSAM is 'outperforming PPO' | 92.57 | Positive (Qualitative) |
| GSM8K | Accuracy | Lower than ESSAM (implied) | 78.92 | Positive (Qualitative) |