Evaluation Setup
Distilling reasoning capabilities for arithmetic problems
Benchmarks:
- GSM8K (Arithmetic Reasoning)
Metrics:
- Accuracy
- Output Length (Tokens)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| GSM8K |
Accuracy Improvement |
Not explicitly reported in the paper |
Not explicitly reported in the paper |
+11.29%
|
| GSM8K |
Length Reduction |
Not explicitly reported in the paper |
Not explicitly reported in the paper |
-27.4%
|
Main Takeaways
- Direct SFT on verbose CoT harms small models due to capacity mismatch.
- Structural understanding (Stage 1) is a prerequisite for effective compression.
- Teacher-guided rewriting (Stage 3) effectively recovers performance on hard cases where the student initially fails.
- Hierarchical rewards prevent reward hacking where models generate short but incorrect answers.