RewardBench 2: Advancing Reward Model Evaluation

📝 Paper Summary

Reward Model Evaluation RLHF (Reinforcement Learning from Human Feedback) Benchmarking

RewardBench2 introduces a challenging, multi-skill benchmark for reward models using unseen human prompts and 4-way comparison tasks, revealing significant performance drops in current models compared to previous benchmarks.

Core Problem

Existing reward model benchmarks often rely on reused prompts from downstream evaluations (leading to contamination) and fail to distinguish between strong models due to saturated performance on simple chosen/rejected pairs.

Why it matters:

Progress in reward model evaluation lags behind reward model effectiveness, meaning high benchmark scores often don't translate to better downstream policy performance
Users lack reliable signals to select the best reward model for specific post-training needs like RLHF or inference-time scaling
Overfitting to easy or contaminated benchmarks creates a false sense of alignment progress

Concrete Example: Current leading reward models score ~20 points lower on RewardBench2 than on the original RewardBench, exposing failures in domains like Math and Precise Instruction Following where accuracy drops below 70% and 40% respectively.

Key Novelty

Harder 4-way ranking on unseen prompts

Shifts from binary classification (1 chosen vs. 1 rejected) to a 4-way task (1 chosen vs. 3 rejected), lowering the random baseline to 25% and increasing difficulty
Sources ~70% of prompts from unseen human queries (WildChat) rather than recycling prompts from existing benchmarks to prevent contamination
Introduces a 'Ties' domain to test if models can avoid arbitrary preferences between equally valid answers (e.g., 'red' vs. 'green' for 'Name a color of the rainbow')

Architecture

Conceptual diagram of the RewardBench2 evaluation pipeline and its application in downstream tasks.

Evaluation Highlights

Leading reward models score ~20 points lower on average on RewardBench2 compared to RewardBench v1, indicating increased difficulty
In the Precise Instruction Following subset, leading models achieve below 40% accuracy (where random baseline is 25%)
Training reward models for 2 epochs (vs. standard 1 epoch) improves performance for 8 of the top 18 models evaluated

Breakthrough Assessment

8/10

Significantly raises the bar for RM evaluation by fixing contamination and saturation issues. The shift to 1-vs-3 ranking and inclusion of 'Ties' are methodologically strong updates that better reflect downstream needs.

⚙️ Technical Details

Problem Definition

Setting: Pairwise or multi-wise preference modeling

Inputs: Prompt x and a set of completions {y_i}

Outputs: Scalar reward score r(x, y) or probability p(chosen > rejected)

Pipeline Flow

Prompt Sourcing (WildChat + synthetic)
Domain Classification & Filtering
Completion Generation (1 chosen, 3 rejected)
Verification (Manual/LLM-as-judge)
Scoring (Accuracy/Ranking)

System Modules

Prompt Sourcing (Data Construction)

Gather diverse, unseen prompts

Model or implementation: WildChat (Human prompts)

Completion Generation (Data Construction)

Generate 4 responses per prompt (1 chosen, 3 rejected)

Model or implementation: Various LLMs (pool of models)

Evaluation Logic

Score reward models based on ranking capability

Model or implementation: Target Reward Model

Novel Architectural Elements

Structure of evaluation samples: 1 chosen vs. 3 rejected responses (lowering random baseline to 25% from typical 50%)
Inclusion of 'Ties' subset where correct calibration between equally valid answers is measured

Modeling

Base Model: Evaluates >100 models; Controlled experiments use Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Qwen-2.5-7B-Instruct, Tulu-3-8B-SFT

Training Method: Bradley-Terry Reward Modeling (for controlled experiments)

Objective Functions:

Purpose: Maximize likelihood of chosen response scoring higher than rejected.

Formally: L(r) = -log(sigmoid(r(x, y_chosen) - r(x, y_rejected)))

Training Data:

Tulu preference mix
Skywork preference mix

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Comparison of 1 vs. 2 epochs

Compute: Not reported in the paper

Comparison to Prior Work

vs. RewardBench v1: Uses 4-way ranking (1 vs 3) instead of binary, unseen prompts instead of reused ones, and new domains (Math, Focus, Ties)
vs. RM-Bench: Uses unseen prompts to prevent contamination and tests 'Ties' calibration [not cited in paper]
vs. Preference Proxy Evaluations: Avoids contamination by using new prompts rather than prompts from the downstream benchmarks themselves

Limitations

Accuracy-based evaluation may not fully capture 'vibes' or nuance preferred by some users
Reliance on LLM-based filtering/generation for some subsets introduces potential bias
Manual verification is resource-intensive, limiting dataset size (1,876 prompts)

Reproducibility

Code: https://github.com/allenai/reward-bench

Code and benchmark data are publicly available. Controlled training used Open Instruct library. Specific hyperparameters for the trained models (learning rate, etc.) are not detailed in the text but implied to be standard or varied in sweeps.

📊 Experiments & Results

Evaluation Setup

Zero-shot classification/ranking of 4 candidate responses per prompt

Benchmarks:

RewardBench2 (Reward Model Evaluation) [New]

Metrics:

Accuracy (1 chosen > 3 rejected)
Ties Score (Weighted accuracy + margin calibration)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of leading models between RewardBench v1 and RewardBench2 shows a significant drop in performance, highlighting the increased difficulty.
RewardBench2	Average Score Drop	Not reported in the paper	Not reported in the paper	-20.0 (approx)
RewardBench2 (Math Subset)	Accuracy	25.0	70.0	+45.0
RewardBench2 (Precise IF Subset)	Accuracy	25.0	40.0	+15.0

Experiment Figures

Scatter plot comparing model performance on RewardBench v1 vs. RewardBench2.

Main Takeaways

Training reward models for more than one epoch (specifically 2) can be beneficial, contrary to common best practices avoiding overfitting.
Base model selection is critical; Llama 3.1 Instruct-based RMs generally perform well, but Qwen-based RMs dominate the Math domain.
Data mixing helps: Combining Skywork and Tulu preference data outperforms training on either alone across all base models.
Performance on RewardBench2 correlates well with downstream success in Best-of-N sampling and PPO training, validating its utility as a proxy metric.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry preference model
Language Model post-training pipelines

Key Terms

Reward Model (RM): A model trained to output a scalar score predicting the quality or human preference of a text response

Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another based on their latent reward scores

RLHF: Reinforcement Learning from Human Feedback—a method to align language models using a reward model trained on human preferences

BoN: Best-of-N sampling—an inference-time technique where a model generates N responses and a reward model selects the highest-scoring one

PPO: Proximal Policy Optimization—an RL algorithm used to train a policy model (LLM) to maximize the reward signal

WildChat: A dataset of real-world user-ChatGPT interactions used here as a source of unseen prompts

SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality instruction-response pairs

OOD: Out-of-Distribution—data that differs significantly from the data seen during training