Evaluation Setup
Zero-shot Chain-of-Thought reasoning on math and science datasets
Benchmarks:
- GSM8K (Grade school math word problems)
- MATH (Challenging mathematics problems)
- ARC-Challenge (Science question answering)
Metrics:
- Accuracy (Exact Match)
- Majority Voting Accuracy (@32 samples)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main comparison on GSM8K shows Iterative RPO significantly outperforming baselines including Zero-Shot CoT, SFT, and standard DPO. |
| GSM8K |
Accuracy (Greedy) |
55.6 |
81.6 |
+26.0
|
| GSM8K |
Accuracy (Greedy) |
63.5 |
81.6 |
+18.1
|
| GSM8K |
Accuracy (Greedy) |
61.8 |
73.1 |
+11.3
|
| GSM8K |
Accuracy (Maj@32) |
70.7 |
88.7 |
+18.0
|
| ARC-Challenge |
Accuracy |
77.8 |
86.7 |
+8.9
|
| MATH |
Accuracy (Greedy) |
12.5 |
20.8 |
+8.3
|
| Ablation studies demonstrate the critical role of the NLL loss term in the training objective. |
| GSM8K |
Accuracy (Greedy) |
61.8 |
73.1 |
+11.3
|
Main Takeaways
- Iterative training is effective: Performance improves consistently across iterations (e.g., GSM8K: 73.1% -> 78.0% -> 81.1% -> 81.6%), though gains saturate.
- Negative examples matter: Preference optimization (using losers) outperforms SFT (STaR-like approaches) which only use positive examples.
- NLL loss is crucial: Adding Negative Log-Likelihood to the DPO objective prevents the model from deviating too far from the likelihood of correct reasoning chains, which is essential for reasoning tasks.
- Data quantity vs. Iteration: Two iterations of training are more effective than one iteration with double the data, suggesting the model update step is key to generating better training signals.