Evaluation Setup
Pairwise preference prediction across diverse domains (Chat, Reasoning, Safety, Code)
Benchmarks:
- RewardBench (General Reward Modeling (Chat, Safety, Reasoning))
- RM-Bench (Reasoning-heavy preference evaluation)
- RMB (Comprehensive reward modeling benchmark)
Metrics:
- Accuracy (Preference Prediction)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on RM-Bench (Reasoning Heavy Benchmark). BR-RM achieves SOTA, surpassing both large scalar models and reasoning models. |
| RM-Bench |
Accuracy |
83.9 |
85.9 |
+2.0
|
| RM-Bench |
Accuracy |
73.2 |
85.9 |
+12.7
|
| Performance on RMB (Comprehensive Benchmark). BR-RM shows strong generalization, ranking top-2 overall. |
| RMB |
Accuracy |
70.5 |
74.7 |
+4.2
|
| Performance on RewardBench. While some scalar RMs saturate this benchmark, BR-RM remains highly competitive and balanced. |
| RewardBench |
Accuracy |
93.1 |
94.0 |
+0.9
|
Main Takeaways
- Scalar RMs perform well on general chat (RewardBench) but collapse on reasoning-intensive tasks (RM-Bench, RMB), indicating they rely on heuristics.
- Reasoning capability scales with size: existing ReasonRMs (like RM-R1) only show gains at 32B+, whereas BR-RM achieves SOTA performance at 14B and strong performance at 8B.
- The Branch-and-Rethink strategy allows smaller models (8B) to outperform significantly larger models (GPT-4o, 70B) by allocating compute more efficiently to critical errors.