| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Single-round DPO experiments on Qwen2.5-7B show that even coarse filtering with a PRM or outcome labels significantly improves performance over the base model. | ||||
| MATH500 | Pass@1 | 66.8 | 72.8 | +6.0 |
| Iterative DPO-VP results compared to RL baselines on challenging math benchmarks. | ||||
| Average (5 Hard Benchmarks) | Pass@1 | 48.8 | 48.2 | -0.6 |
| Average (5 Hard Benchmarks) | Pass@1 | 47.7 | 48.2 | +0.5 |
| Average (5 Hard Benchmarks) | Pass@1 | 47.0 | 48.2 | +1.2 |