Evaluation Setup
Binary classification on pairs (chosen vs. rejected) across 4 domains (Chat, Code, Math, Safety) and 3 style variations.
Benchmarks:
- RM-Bench (Reward Model Evaluation) [New]
Metrics:
- Easy Accuracy (substance correct + style favorable)
- Normal Accuracy (substance correct + style neutral)
- Hard Accuracy (substance correct + style unfavorable)
- Average Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Evaluation of state-of-the-art reward models on RM-Bench shows that even the largest models struggle significantly compared to random guessing. |
| RM-Bench |
Average Accuracy |
50.0 |
69.5 |
+19.5
|
| RM-Bench |
Average Accuracy |
50.0 |
46.6 |
-3.4
|
| Comparison between DPO models and traditional Sequence Classification reward models suggests DPO handles the task better. |
| RM-Bench |
Average Accuracy |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- Substantial room for improvement: Top models like Nemotron-340B only reach ~69.5% accuracy, far below ideal performance.
- Style Bias is severe: Models often fail (below random guessing) when the factually incorrect response has a 'better' style (longer, markdown).
- DPO Superiority: DPO-trained models generally show better correlation and performance on RM-Bench compared to standard sequence-classification reward models.
- Correlation with Policy: RM-Bench scores correlate strongly with the actual performance of policy models trained using these reward models (verified via PPO fine-tuning experiments).