Evaluation Setup
Ranking accuracy on RewardBench and reWordBench (transformed versions)
Benchmarks:
- RewardBench (Standard RM Evaluation (Chat, Chat Hard, Safety, Reasoning))
- reWordBench (Robustness Evaluation (28 transformations)) [New]
Metrics:
- Ranking Accuracy Drop (%)
- Win Rate against SFT/Standard-RM (for alignment)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Robustness evaluation on reWordBench showing accuracy drops on Paraphrase transformations. |
| reWordBench (Chat Hard) |
Accuracy Drop (Paraphrase) |
16.6 |
8.7 |
-7.9
|
| reWordBench (Reasoning) |
Accuracy Drop (Paraphrase) |
20.7 |
15.8 |
-4.9
|
| Generalization of robustness to non-paraphrase transformations (e.g. typos, formatting). |
| reWordBench (Chat Hard) |
Accuracy Drop (Other Transf.) |
6.6 |
6.4 |
-0.2
|
| reWordBench (Safety) |
Accuracy Drop (Other Transf.) |
11.8 |
3.9 |
-7.9
|
| Downstream Alignment Utility (Best-of-N) evaluated by Llama-3-70B-Instruct Judge. |
| RewardBench Prompts (Best-of-64) |
Win Rate vs Standard RM |
50.0 |
59.0 |
+9.0
|
Main Takeaways
- SOTA Reward Models are highly brittle, with accuracy often dropping to random or worse under simple transformations like formatting changes or typos
- Regularizing score consistency on paraphrases is a highly effective strategy that improves robustness not just to paraphrases, but generalizes to other transformation types (e.g., safety-targeted attacks)
- Robust RMs are better RMs: The improvements in robustness translate directly to better downstream alignment performance, producing higher-quality generations in Best-of-N and RAFT settings