On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

📝 Paper Summary

LLM Alignment Reward Modeling Direct Preference Optimization (DPO)

The implicit reward model defined by DPO generalizes significantly worse to out-of-distribution prompts and responses than explicitly trained reward models, motivating the use of explicit rewards in iterative alignment.

Core Problem

DPO aligns models without an explicit reward model (EXRM), but it is unclear if the implicit reward model (DPORM) learned during this process generalizes well to unseen data.

Why it matters:

Poor reward generalization leads to over-optimization and reward hacking when the policy model encounters OOD data during training
Iterative alignment methods (like Iterative DPO) rely on the reward model to label new model generations; if the reward model fails on OOD data, the entire alignment process degrades

Concrete Example: When a reward model trained on UltraFeedBack is used to evaluate responses from a different dataset like Reddit Summarization (Prompt Shift), DPORM's accuracy drops significantly compared to EXRM, potentially mislabeling preferred summaries.

Key Novelty

Systematic Generalization Audit of DPO vs. Explicit Reward Models

Conducts extensive experiments comparing explicit reward models (EXRM) against DPO's implicit reward (DPORM) across 5 train-test shifts and 3 model scales
Isolates specific types of distribution shifts: Prompt Shift (different domains) and Response Shift (different generator models)
Demonstrates that while DPORM fits training data well, it lacks the robustness of EXRM, justifying hybrid approaches like Iterative DPO with explicit rewards

Architecture

Overview of the RLHF and DPO pipelines, contrasting Explicit Reward Model training (RLHF) with Implicit Reward formulation (DPO).

Evaluation Highlights

Across 5 out-of-distribution settings, DPORM suffers a mean accuracy drop of 3% and a maximum drop of 7% compared to EXRM
EXRM achieves a higher win rate (accuracy > 50%) than DPORM in over 90% of out-of-distribution experiments
In iterative DPO alignment, using an EXRM for labeling results in a 10.3% higher win rate on AlpacaEval compared to using DPORM (57.8% vs 47.5%)

Breakthrough Assessment

4/10

This is a rigorous empirical analysis rather than a new method. It provides crucial insights into the limitations of DPO, challenging the assumption that implicit rewards are sufficient for robust alignment.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of preference pairs (chosen vs. rejected) under distribution shifts

Inputs: Prompt x and two responses (y_w, y_l)

Outputs: Reward scores r(x, y_w) and r(x, y_l) used to predict preference probability P(y_w > y_l)

Pipeline Flow

Training Phase: Train EXRM (classifier) and DPORM (implicit via DPO policy) on Source Data
Evaluation Phase: detailed in Experiments section
Alignment Phase (Iterative DPO): Generate responses -> Label with RM -> Update Policy

System Modules

Explicit Reward Model (EXRM) (Reward Modeling)

Predict scalar reward for prompt-response pairs

Model or implementation: Gemma-2B, Gemma-7B, Mistral-7B (initialized with SFT weights, added linear head)

Implicit Reward Model (DPORM) (Reward Modeling)

Derive reward score from policy probabilities

Model or implementation: Gemma-2B, Gemma-7B, Mistral-7B

Modeling

Base Model: Gemma-2B, Gemma-7B, Mistral-7B (all instruction-tuned versions)

Training Method: Direct Preference Optimization (DPO) and Explicit Reward Modeling

Objective Functions:

Purpose: Train EXRM to distinguish chosen/rejected.

Formally: Minimize -log(sigmoid(r(x, y_w) - r(x, y_l)))
Purpose: Train DPO policy (which induces DPORM).

Formally: Minimize -log(sigmoid(beta * log(pi/pi_ref)(y_w) - beta * log(pi/pi_ref)(y_l)))

Key Hyperparameters:

learning_rate_rm: 5e-6
learning_rate_dpo: 1e-6
beta: 0.03
+ 2 more
epochs_rm: 1
epochs_dpo: 2

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLHF: This paper evaluates the reward component specifically, showing EXRM (used in RLHF) is more robust than DPO's implicit reward
vs. Standard DPO: Highlights a weakness in DPO (OOD generalization) that justifies the added complexity of Iterative DPO with explicit rewards

Limitations

Focuses only on 2B-7B parameter models; larger models might behave differently
Limited to English language tasks (chat, summarization, instruction following)
Does not explore code generation or reasoning tasks
Impact of base model pre-training data is uncontrolled/unknown

Reproducibility

Code: https://github.com/huggingface/trl

📊 Experiments & Results

Evaluation Setup

Pairwise accuracy in identifying the preferred response across In-Distribution (ID) and Out-of-Distribution (OOD) datasets

Benchmarks:

HH-RLHF (Dialogue preference)
UltraFeedBack (General instruction following)
Summarisation (Text summarization (Reddit TL;DR))
AlpacaEval (Instruction following generation)

Metrics:

Accuracy (identifying chosen vs. rejected)
Win Rate (against baseline model in generation)
Statistical methodology: Mean and standard deviation over three random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generalization experiments across mixed distribution shifts (Setting I). EXRM consistently outperforms DPORM on OOD data.
Average across 5 OOD datasets	Accuracy Drop	Not reported in the paper	Not reported in the paper	Not reported in the paper
Controlled shift experiments isolating prompt and response shifts.
Summarization (Train: Reddit, Test: CNN/DailyMail)	Accuracy	56.0	59.0	+3.0
Ultra-LM3-RLHF (OOD Responses)	Accuracy	60.0	65.0	+5.0
Iterative DPO alignment performance using different reward models for labeling.
AlpacaEval 2.0	Win Rate vs GPT-4 Turbo	32.6	57.8	+25.2
AlpacaEval 2.0	Win Rate vs GPT-4 Turbo	47.5	57.8	+10.3

Experiment Figures

Comparison of ID vs OOD accuracy for EXRM and DPORM across multiple datasets and shift types.

Main Takeaways

DPORM and EXRM have similar accuracy on In-Distribution (ID) data (roughly equal win rate), but DPORM degrades significantly on Out-of-Distribution (OOD) data.
The generalization gap exists for both Prompt Shift (new domains) and Response Shift (new generator models), with EXRM proving more robust in both cases.
The limited generalization of DPORM has downstream consequences: using it for Iterative DPO leads to significantly worse instruction-following policies compared to using an EXRM.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
Direct Preference Optimization (DPO)
Reward Modeling

Key Terms

EXRM: Explicit Reward Model—a separate classifier trained on preference data to predict a scalar reward for a prompt-response pair

DPORM: DPO Reward Model—the implicit reward function defined by the log-ratio of the DPO-trained policy and the reference policy

DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference data without training a separate reward model

RLHF: Reinforcement Learning from Human Feedback—a three-stage process involving supervised fine-tuning, reward modeling, and reinforcement learning (usually PPO)

OOD: Out-of-Distribution—data samples that differ significantly (in prompt domain or response style) from the training set

Iterative DPO: An alignment strategy where the model generates new responses, a reward model labels them, and DPO is applied to this new dataset iteratively

AlpacaEval: A benchmark for evaluating instruction-following models by comparing their responses against a reference model (often GPT-4) using an LLM judge