Evaluation Setup
Standard RLHF pipeline: SFT -> Reward Modeling -> PPO. Evaluation on held-out preference data and via 'Gold RM' scoring.
Benchmarks:
- AlpacaFarm (Instruction following / Open-ended QA)
- Anthropic HH (Helpful & Harmless) (Preference classification)
- WebGPT Comparisons (Preference classification)
Metrics:
- Reward Modeling Accuracy (%)
- Win Rate vs. SFT baseline (judged by GPT-4)
- Gold RM Score vs. Proxy RM Score (to measure reward hacking)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Reward Modeling accuracy on standard benchmarks shows UMM-RM consistently outperforming dense baselines across different model sizes. |
| Anthropic HH-Helpful |
Accuracy |
65.6 |
67.2 |
+1.6
|
| WebGPT Comparisons |
Accuracy |
58.6 |
60.8 |
+2.2
|
| Win rate evaluation on AlpacaFarm shows that PPO training with UMM-RM produces better policies than dense or ensemble RMs. |
| AlpacaFarm |
Win Rate vs SFT |
51.5 |
60.5 |
+9.0
|
| AlpacaFarm |
Win Rate vs SFT |
57.5 |
60.5 |
+3.0
|
Main Takeaways
- Increasing the number of activated experts (from 2 to 6) consistently improves robustness and win rates.
- The shared expert coefficient is critical; a balanced weight (0.5) works best, while too high (0.9) degrades performance to near-dense levels.
- Unmerged MoE models alone do not reliably suppress reward hacking; the merging step is crucial for smoothing the reward surface.
- UMM-RM achieves comparable or better alignment performance than expensive ensembles while maintaining the inference cost of a single dense model.