Evaluation Setup
Mathematical reasoning tasks using Chain-of-Thought
Benchmarks:
- MATH (Challenging math problems)
- GSM8K (Grade school math word problems)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results on MATH and GSM8K showing Step-DPO improvements over base models and vanilla DPO. |
| MATH |
Accuracy |
67.9 |
70.8 |
+2.9
|
| GSM8K |
Accuracy |
91.1 |
94.0 |
+2.9
|
| MATH |
Accuracy |
47.2 |
58.6 |
+11.4
|
| MATH |
Accuracy |
52.8 |
56.0 |
+3.2
|
| Ablation study on data source distribution (In-Distribution vs. Out-of-Distribution). |
| MATH |
Accuracy |
50.1 |
53.0 |
+2.9
|
| MATH |
Accuracy |
50.8 |
53.0 |
+2.2
|
Main Takeaways
- Step-DPO consistently outperforms vanilla DPO across multiple model sizes (7B to 72B) and families (Qwen, DeepSeek), often where vanilla DPO degrades or stagnates performance.
- Data efficiency is high: significant gains are achieved with only 10K examples and fewer than 500 training steps.
- In-distribution data (self-generated corrections) is crucial; training on human or GPT-4 corrected steps is less effective because the model struggles to learn from OOD distributions in the DPO framework.
- Step-DPO helps the model maintain a larger reward margin between correct and incorrect steps compared to vanilla DPO, indicating better discrimination ability.