| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Application of MM-PRM to the base policy model (MM-Policy) shows consistent improvements across all benchmarks. | ||||
| MM-K12 | Accuracy | 33.92 | 42.80 | +8.88 |
| OlympiadBench | Accuracy | 15.41 | 24.00 | +8.59 |
| MathVista | Accuracy | 62.93 | 67.60 | +4.67 |
| MM-PRM generalizes to other model sizes (InternVL2.5 series) not used in PRM training. | ||||
| MM-K12 | Accuracy | 27.01 | 37.80 | +10.79 |
| OlympiadBench | Accuracy | 30.98 | 34.67 | +3.69 |
| Ablation study on labeling strategy confirms the superiority of soft labels over hard binary thresholds. | ||||
| MM-K12 | Accuracy | 37.0 | 42.8 | +5.8 |