RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

📝 Paper Summary

Reward Model Evaluation RLHF Alignment Safety and Robustness

RM-Bench evaluates reward models on their ability to detect subtle errors and resist style biases (like length) using paired responses generated by the same powerful model.

Core Problem

Existing reward model benchmarks often compare responses from models of vastly different capabilities (strong vs. weak), making the distinction too easy and failing to test sensitivity to subtle errors or resistance to style hacking.

Why it matters:

Reward models are the critical signal for aligning LLMs via RLHF; if they fail, the policy model learns incorrect behaviors
Current benchmarks have low correlation with actual policy model performance because they don't capture the subtle distinctions needed during training
Reward models are prone to 'style over substance' bias, preferring longer or better-formatted answers even if they contain factual errors

Concrete Example: A reward model might prefer a long, markdown-formatted response that is factually incorrect (e.g., claiming a wrong historical date) over a concise, plain-text response that is correct, simply because of the style bias.

Key Novelty

Style-Controlled Sensitivity Benchmarking

Generates both chosen and rejected responses using the *same* powerful model (GPT-4o) to ensure high quality and subtle differences, rather than pairing strong vs. weak models
Introduces controlled style variations (Concise, Detailed, Markdown) for every prompt to explicitly test if reward models can separate substance from style
Evaluates resistance to 'jailbreaking' where subtle factual errors are injected into high-quality responses to test precise discrimination

Evaluation Highlights

State-of-the-art reward models achieve only 46.6% accuracy under style bias interference (worse than random guessing)
Even the massive Nemotron-340B-Reward model struggles, achieving only 69.5% overall accuracy on RM-Bench
DPO (Direct Policy Optimization) models generally outperform sequence-classification reward models on this benchmark

Breakthrough Assessment

8/10

Significantly exposes the fragility of current reward models regarding style bias. The methodology of using the same generator for chosen/rejected pairs addresses a major flaw in prior benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Reward Model Evaluation as binary classification

Inputs: A tuple (x, y_c, y_r) consisting of a prompt x, a chosen response y_c, and a rejected response y_r

Outputs: A binary prediction of whether y_c is better than y_r (reward(y_c) > reward(y_r))

Pipeline Flow

Prompt Collection (from AlpacaEval, HumanEvalPack, MATH, XSTest)
Response Generation (using GPT-4o for both chosen and rejected)
Error Injection (Jailbreaking/Multi-sampling to create subtle rejected responses)
Style Variation Generation (rewriting responses into Concise, Detailed, Markdown styles)
Reward Model Evaluation (calculating accuracy across style matrix)

System Modules

Generator (Data Construction)

Generate high-quality base responses

Model or implementation: gpt-4o

Injector (Data Construction)

Create rejected responses by introducing errors

Model or implementation: gpt-4o (with Many-Shot Jailbreak or Multi-sampling)

Style Controller (Data Construction)

Create style variants for robustness testing

Model or implementation: gpt-4o

Novel Architectural Elements

Style-Substance Evaluation Matrix: A 3x3 evaluation grid comparing chosen/rejected responses across 3 different styles (Concise, Detailed, Markdown) to decouple style from content quality

Modeling

Base Model: Various (evaluates ~40 models including Nemotron-340B, Skywork-Reward, Llama-3-based models)

Comparison to Prior Work

vs. RewardBench: RM-Bench uses same-model generation (GPT-4o) for both chosen/rejected to minimize style discrepancies, whereas RewardBench often pairs strong vs. weak models
vs. Standard Preference Sets: RM-Bench explicitly controls style (length, formatting) to separate style bias from factual correctness [not cited in paper as a benchmark, but as general practice]
vs. Cost-saving benchmarks: RM-Bench avoids the 'easy' comparisons of strong vs. weak models that inflate accuracy numbers

Limitations

Relies on GPT-4o for data generation and error injection, effectively treating it as the ground truth oracle
Safety domain construction required uncensored models (Llama-3-Uncensored) which might differ in style from GPT-4o despite controls
Focuses primarily on English language tasks
Evaluation is limited to classification accuracy and does not cover other reward model capabilities like scalar score stability

Reproducibility

Code: https://github.com/THU-KEG/RM-Bench

publicly available (https://github.com/THU-KEG/RM-Bench). Data and evaluation scripts are provided. The specific prompt templates for generating the dataset (including jailbreak prompts) are in the Appendix.

📊 Experiments & Results

Evaluation Setup

Binary classification on pairs (chosen vs. rejected) across 4 domains (Chat, Code, Math, Safety) and 3 style variations.

Benchmarks:

RM-Bench (Reward Model Evaluation) [New]

Metrics:

Easy Accuracy (substance correct + style favorable)
Normal Accuracy (substance correct + style neutral)
Hard Accuracy (substance correct + style unfavorable)
Average Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of state-of-the-art reward models on RM-Bench shows that even the largest models struggle significantly compared to random guessing.
RM-Bench	Average Accuracy	50.0	69.5	+19.5
RM-Bench	Average Accuracy	50.0	46.6	-3.4
Comparison between DPO models and traditional Sequence Classification reward models suggests DPO handles the task better.
RM-Bench	Average Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

Substantial room for improvement: Top models like Nemotron-340B only reach ~69.5% accuracy, far below ideal performance.
Style Bias is severe: Models often fail (below random guessing) when the factually incorrect response has a 'better' style (longer, markdown).
DPO Superiority: DPO-trained models generally show better correlation and performance on RM-Bench compared to standard sequence-classification reward models.
Correlation with Policy: RM-Bench scores correlate strongly with the actual performance of policy models trained using these reward models (verified via PPO fine-tuning experiments).

📚 Prerequisite Knowledge

Prerequisites

Understanding of RLHF (Reinforcement Learning from Human Feedback)
Familiarity with Reward Models and their role in alignment
Knowledge of biases in LLMs (e.g., verbosity bias)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to align language models by training a reward model on human preferences and then optimizing the policy model

Policy Model: The language model being trained or aligned to generate responses

Reward Model: A model that acts as a proxy for human evaluation, assigning a numerical score to the quality of a response

DPO: Direct Policy Optimization—an algorithm that optimizes the policy model directly on preference data without training a separate explicit reward model first

Inference Scaling Law: The observation that allowing a model to search or reason longer (e.g., selecting the best of N samples) improves performance, often guided by a reward model

Jailbreaking: Techniques used to bypass a model's safety filters; here used to inject subtle errors into valid responses

PPO: Proximal Policy Optimization—an RL algorithm used to update the policy model based on reward signals