Evaluation Setup
Evaluation of pretrained, SFT, and RL models on downstream reasoning benchmarks
Benchmarks:
- Math Competitions (Mathematical Reasoning)
- Scientific QA (Science Reasoning)
- Code (Software Engineering)
- General Reasoning (Broad reasoning tasks)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Expert-level benchmarks (Average) |
Accuracy Gain |
0.0 |
19.0 |
+19.0
|
| Reasoning Tasks (Pretraining Phase) |
Accuracy Gain |
0.0 |
11.0 |
+11.0
|
| Reasoning Tasks (SFT Phase) |
Accuracy Gain |
0.0 |
15.0 |
+15.0
|
| Downstream Accuracy |
Accuracy Gain |
0.0 |
4.0 |
+4.0
|
| Mathematical Reasoning |
Accuracy Change |
0.0 |
-5.0 |
-5.0
|
Main Takeaways
- Front-loading reasoning data is essential; SFT cannot 'catch up' to a model pretrained with reasoning foundations.
- Asymmetric principle: Pretraining benefits from Diversity/Scale, SFT benefits from Quality/Complexity.
- Naive scaling of SFT data (more is better) is harmful; quality filters are critical for the post-training stage.
- High-quality data in pretraining has a 'latent' effect—its value is fully realized only after alignment (SFT).