| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of leading models between RewardBench v1 and RewardBench2 shows a significant drop in performance, highlighting the increased difficulty. | ||||
| RewardBench2 | Average Score Drop | Not reported in the paper | Not reported in the paper | -20.0 (approx) |
| RewardBench2 (Math Subset) | Accuracy | 25.0 | 70.0 | +45.0 |
| RewardBench2 (Precise IF Subset) | Accuracy | 25.0 | 40.0 | +15.0 |