← Back to Paper List

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
Fudan University, Tsinghua University, Hong Kong University of Science and Technology
International Conference on Learning Representations (2024)
Benchmark RL Factuality

📝 Paper Summary

Reward Model Evaluation RLHF Alignment Safety and Robustness
RM-Bench evaluates reward models on their ability to detect subtle errors and resist style biases (like length) using paired responses generated by the same powerful model.
Core Problem
Existing reward model benchmarks often compare responses from models of vastly different capabilities (strong vs. weak), making the distinction too easy and failing to test sensitivity to subtle errors or resistance to style hacking.
Why it matters:
  • Reward models are the critical signal for aligning LLMs via RLHF; if they fail, the policy model learns incorrect behaviors
  • Current benchmarks have low correlation with actual policy model performance because they don't capture the subtle distinctions needed during training
  • Reward models are prone to 'style over substance' bias, preferring longer or better-formatted answers even if they contain factual errors
Concrete Example: A reward model might prefer a long, markdown-formatted response that is factually incorrect (e.g., claiming a wrong historical date) over a concise, plain-text response that is correct, simply because of the style bias.
Key Novelty
Style-Controlled Sensitivity Benchmarking
  • Generates both chosen and rejected responses using the *same* powerful model (GPT-4o) to ensure high quality and subtle differences, rather than pairing strong vs. weak models
  • Introduces controlled style variations (Concise, Detailed, Markdown) for every prompt to explicitly test if reward models can separate substance from style
  • Evaluates resistance to 'jailbreaking' where subtle factual errors are injected into high-quality responses to test precise discrimination
Evaluation Highlights
  • State-of-the-art reward models achieve only 46.6% accuracy under style bias interference (worse than random guessing)
  • Even the massive Nemotron-340B-Reward model struggles, achieving only 69.5% overall accuracy on RM-Bench
  • DPO (Direct Policy Optimization) models generally outperform sequence-classification reward models on this benchmark
Breakthrough Assessment
8/10
Significantly exposes the fragility of current reward models regarding style bias. The methodology of using the same generator for chosen/rejected pairs addresses a major flaw in prior benchmarks.
×