| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results showing SFT data scaling leads to state-of-the-art performance on LLaMA-2 7B. | ||||
| GSM8K | Accuracy | 68.4 | 82.6 | +14.2 |
| MATH | Accuracy | 19.8 | 40.6 | +20.8 |
| GSM8K | Accuracy | 81.6 | 82.6 | +1.0 |
| GSM8K | Accuracy | 83.5 | 90.6 | +7.1 |
| MATH | Accuracy | 42.5 | 52.8 | +10.3 |
| Pass@N analysis reveals the instability issue in base models. | ||||
| GSM8K | Pass@256 | 48.2 | 97.7 | +49.5 |