| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| CommonsenseQA (Dev Set) | Accuracy | 60.0 | 72.5 | +12.5 |
| CommonsenseQA (Dev Set) | Accuracy | 36.6 | 72.5 | +35.9 |
| CommonsenseQA (Dev Set) | Accuracy | 73.0 | 72.5 | -0.5 |
| GSM8K | Accuracy | 5.8 | 10.7 | +4.9 |
| CommonsenseQA | Accuracy | 68.8 | 72.5 | +3.7 |