Generative Reward Models

📝 Paper Summary

Reward Modeling RLHF / RLAIF

GenRM trains a reward model to generate reasoning traces before rendering a verdict using an iterative self-teaching loop, achieving superior out-of-distribution generalization compared to standard discriminative classifiers.

Core Problem

Standard reward models (discriminative classifiers) perform well on training data but generalize poorly to new distributions, while LLM-as-a-judge approaches are robust but lack alignment with specific human preferences.

Why it matters:

RLHF relies on accurate reward models to guide policy optimization; if the reward model fails on out-of-distribution data, the resulting LLM may be misaligned
Collecting human preference data is resource-intensive, making high-quality synthetic preferences (RLAIF) crucial for scaling
Current hybrid methods struggle to combine the in-distribution accuracy of trained reward models with the reasoning capabilities of large language models

Concrete Example: In a case study (Figure 3), a standard LLM judge incorrectly prefers a detailed response about '2 animals' that ignores the length constraint. The proposed STaR-DPO model generates a rationale explicitly noting the failure to follow instructions ('lacks depth... does not follow instruction') and correctly prefers the compliant response.

Key Novelty

Generative Reward Models (GenRM) with Self-Taught Reasoning

Reformulates reward modeling as a generative task where the model produces a Chain-of-Thought rationale followed by a preference token, rather than outputting a scalar score
Uses a STaR (Self-Taught Reasoner) loop to bootstrap training data: the model generates its own rationales, filters for those leading to correct ground-truth labels, and trains on them
Applies DPO (Direct Preference Optimization) to the reasoning traces themselves, optimizing the model to prefer rationales that result in correct judgments over those that do not

Evaluation Highlights

STaR-DPO achieves 91.0% accuracy on RewardBench Safety, significantly outperforming the best baseline PairRM (81.8%)
On RewardBench Reasoning tasks, STaR-DPO scores 87.2%, surpassing the standard generative reward model (GenRM) which scores only 70.8%
Maintains in-distribution performance parity with Bradley-Terry models (73.9% vs ~74%) while outperforming them on out-of-distribution tasks (81.9% vs <60% for BT)

Breakthrough Assessment

8/10

Significantly improves reward model robustness and OOD generalization by successfully applying reasoning-based reinforcement learning (STaR/DPO) to the evaluation process itself.

⚙️ Technical Details

Problem Definition

Setting: Preference modeling where a model estimates the probability that response yw is preferred over yl given prompt x

Inputs: Prompt x, Response A y1, Response B y2

Outputs: Preference probability P(yw > yl | x) derived from generative tokens (reasoning trace r + indicator I)

Pipeline Flow

Input Processing: Format Prompt + Response Pair
Reasoning Generation: Model generates CoT rationale
Verdict Prediction: Model generates preference token (e.g., 'A' or 'B')

System Modules

GenRM Judge

Evaluate the quality of two responses and predict the preferred one

Model or implementation: Llama-3.1-8B-Instruct

Novel Architectural Elements

Integration of STaR (Self-Taught Reasoner) loop specifically for Reward Modeling
Application of DPO loss directly to reasoning traces (optimizing rationale quality based on judgment correctness)

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Iterative STaR loop with DPO

Objective Functions:

Purpose: SFT Baseline (GenRM).

Formally: Standard maximum likelihood on the indicator token I.
Purpose: STaR-SFT.

Formally: Maximum likelihood on (reasoning r, indicator I) for traces that led to the correct ground-truth preference.
Purpose: STaR-DPO.

Formally: DPO objective maximizing margin between rationales leading to correct decisions (winner) and those leading to incorrect decisions (loser).

Training Data:

UltraFeedback (61k pairs, general instruction)
UltraInteract (Reasoning trees, math/code)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Bradley-Terry: GenRM produces reasoning traces and handles OOD data significantly better
vs. LLM-as-a-judge: GenRM is fine-tuned to align with human preference datasets while retaining reasoning
vs. PairRM: GenRM (STaR-DPO) significantly outperforms on Safety and Reasoning categories in RewardBench

Limitations

Dependency on the quality of the base model's initial reasoning capabilities to bootstrap the STaR loop
Requires ground-truth preference labels to filter generated reasoning traces during training
Inference cost is higher than scalar reward models due to token generation for reasoning traces

Reproducibility

Prompts are provided in Appendix A. Training datasets (UltraFeedback, UltraInteract) are public. Code URL is not provided in the text. Evaluation performed on RewardBench.

📊 Experiments & Results

Evaluation Setup

Train on preference datasets (UltraFeedback/UltraInteract), evaluate on held-out ID splits and OOD benchmark (RewardBench)

Benchmarks:

UltraFeedback (Held-out) (General Instruction Following (In-Distribution))
RewardBench (General Reward Model Benchmark (Out-of-Distribution))
UltraInteract (Reasoning/Math/Code)

Metrics:

Accuracy (matching human preference labels)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
In-distribution performance on UltraFeedback shows GenRM methods matching traditional approaches.
UltraFeedback	Accuracy	74.0	73.9	-0.1
Out-of-distribution performance on RewardBench demonstrates superior generalization of the STaR-DPO approach.
RewardBench (Average)	Accuracy	77.8	81.9	+4.1
RewardBench (Safety)	Accuracy	81.8	91.0	+9.2
RewardBench (Reasoning)	Accuracy	70.8	87.2	+16.4

Experiment Figures

Comparison of accuracies on UltraFeedback (In-Domain) and RewardBench (OOD) across different reward modeling approaches.

Main Takeaways

Generative Reward Models with STaR-DPO match the in-distribution accuracy of specialized Bradley-Terry models while providing significantly better OOD robustness.
Training on self-generated reasoning traces (STaR) combined with preference optimization (DPO) is crucial; simple SFT on reasoning traces (STaR-SFT) yields negligible gains over the base model.
The approach is particularly effective for Safety and Reasoning tasks, where explicit reasoning helps identify nuances that scalar reward models miss.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry preference models
Chain-of-Thought (CoT) prompting

Key Terms

RLAIF: Reinforcement Learning from AI Feedback—using an AI model (instead of humans) to generate preference labels for training

STaR: Self-Taught Reasoner—an iterative method where a model generates reasoning traces, filters for correct answers, and fine-tunes on its own high-quality outputs

DPO: Direct Preference Optimization—an algorithm that optimizes a model to prefer winning responses over losing ones without explicitly training a separate reward function

Bradley-Terry model: A standard statistical model for estimating the probability that one item is preferred over another based on a latent reward score

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before producing a final answer

OOD: Out-of-Distribution—data that differs significantly from the training set statistics

GenRM: Generative Reward Model—the paper's proposed method of using a generative LLM with reasoning capabilities as the reward signal