Evaluation Setup
Pairwise preference prediction on standard reward model benchmarks.
Benchmarks:
- RewardBench (General Chat, Coding, Math, Safety)
- RM-Bench (Reasoning-intensive (Math, Code))
- RMB (General Preference)
Metrics:
- Accuracy (Macro Average across subsets)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| RM-R1 models outperform both open-weight and proprietary baselines on average across three benchmarks. |
| Average (RewardBench, RM-Bench, RMB) |
Accuracy |
86.1 |
88.6 |
+2.5
|
| Average (RewardBench, RM-Bench, RMB) |
Accuracy |
83.7 |
88.6 |
+4.9
|
| RM-Bench (Math Subset) |
Accuracy |
73.0 |
91.8 |
+18.8
|
| RM-Bench (Code Subset) |
Accuracy |
63.0 |
74.1 |
+11.1
|
| Ablation studies show that Distillation + RL + Rubrics + QC (Full RM-R1) provides the best performance. |
| RewardBench |
Accuracy |
88.6 |
90.7 |
+2.1
|
| RM-Bench |
Accuracy |
59.2 |
72.0 |
+12.8
|
Main Takeaways
- Scaling inference compute (token budget) linearly improves reward model performance, akin to reasoning models.
- Larger models yield greater performance gains from the reasoning-based training pipeline, supporting a scaling law for ReasRMs.
- Reasoning-based training (distillation + RL) consistently outperforms answer-only SFT, even when controlling for data size.
- The Chain-of-Rubrics mechanism is crucial for bridging the gap between general chat evaluation and rigorous reasoning tasks.