Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Direct Preference Optimization (DPO) Theoretical Analysis of Preference Learning

RLHF and DPO are not equivalent when models are mis-specified or data is limited; RLHF is more sample-efficient for sparse rewards, while DPO is robust against reward mis-specification but struggles with policy mis-specification.

Core Problem

Common theoretical analyses assume 'realizability' (the true reward/policy fits in the model class), implying RLHF and DPO are equivalent, but this assumption fails in practice due to limited model capacity and data.

Why it matters:

Practitioners often observe performance differences between RLHF and DPO that standard theory cannot explain
Reward models are often much smaller than policy models (e.g., 6B vs 175B), creating significant representation gaps
Understanding when to use DPO vs. RLHF is critical for efficient LLM alignment under resource constraints

Concrete Example: In a scenario where the ground-truth reward depends on sparse features (e.g., only a few keywords matter) but data is limited, DPO requires significantly more samples to learn effectively compared to RLHF, which can leverage the sparsity in the reward modeling stage.

Key Novelty

Decomposition of Performance Gap into Representation and Statistical Sources

Analyzes 'Exact Optimization' (infinite data) to show how mis-specification in reward vs. policy classes determines whether RLHF or DPO is superior
Analyzes 'Approximate Optimization' (finite data) to prove RLHF has a statistical advantage (sample efficiency) over DPO when the ground-truth reward is sparse

Evaluation Highlights

RLHF recovers the optimal policy with O(sqrt(k log d / n)) error for sparse rewards (dimension d, sparsity k), whereas DPO error scales with O(d/n), making RLHF significantly more sample-efficient
Under policy mis-specification (optimal policy not in model class), RLHF yields a strictly superior policy to DPO even with infinite data
Under reward mis-specification (true reward not in model class), DPO is strictly superior because it bypasses the flawed reward modeling stage

Breakthrough Assessment

8/10

Provides a rigorous theoretical foundation explaining the empirical 'vibes' about when DPO vs RLHF works better, challenging the common assumption of their equivalence.

⚙️ Technical Details

Problem Definition

Setting: Contextual bandits / Reinforcement Learning with KL-regularization

Inputs: Prompt x, Response y

Outputs: Policy pi(y|x) maximizing expected reward subject to KL constraint

Pipeline Flow

Theoretical Analysis Framework (No explicit software pipeline)

System Modules

RLHF (Theoretical) (Optimization Methods)

Two-stage learner: (1) MLE for reward model r_phi, (2) Policy optimization maximizing V_r_phi

Model or implementation: Parameterized Reward Model (F) and Policy Model (Pi)

DPO (Theoretical) (Optimization Methods)

Single-stage learner: Directly optimizes policy pi_theta using derived preference loss

Model or implementation: Parameterized Policy Model (Pi)

Modeling

Base Model: Theoretical abstractions of Neural Networks (Linear models used for proofs)

Training Method: Theoretical comparison of RLHF vs. DPO objectives

Objective Functions:

Purpose: Train reward model in RLHF.

Formally: Minimize negative log-likelihood of Bradley-Terry model on preference pairs.
Purpose: Train policy in DPO.

Formally: Minimize negative log-likelihood of the implicit reward difference (expressed via policy ratios) under the Bradley-Terry model.

Trainable Parameters: Model weights phi (reward) and theta (policy)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard DPO/RLHF Theory: Removes the 'realizability' assumption common in prior work (e.g., Rafailov et al. 2023)
vs. Swamy et al. (2025): This paper identifies strict gaps under policy mis-specification where Swamy et al. focus on equivalence under realizability
vs. Feng et al. (2025): Provides objective-based analysis of Online DPO convergence complementary to their gradient-based analysis

Limitations

Analysis is primarily theoretical with simplified linear settings for sample complexity proofs
Does not account for optimization difficulties (e.g., instability of PPO) in the theoretical bounds
Assumes Bradley-Terry model holds for human preferences, which may not always be true

Reproducibility

Theoretical paper. Proofs are provided in the Appendix. Code for controlled experiments is not explicitly linked but methodology is described.

📊 Experiments & Results

Evaluation Setup

Theoretical analysis supported by controlled simulation experiments

Benchmarks:

Synthetic Sparse Reward Task (Controlled linear bandit simulation) [New]

Metrics:

Expected Reward (Value Function)
L2 Estimation Error of Reward Parameter
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Theoretical bounds on sample complexity for sparse reward recovery (dimension d, sparsity k, samples n).
Sparse Linear Reward	Estimation Error	Ω(d/n)	O(sqrt(k log d / n))	Significant reduction
Performance under Policy Mis-specification (Infinite Data / Exact Optimization).
Policy Mis-specification Setting	Optimal Value V*	Sub-optimal	Optimal within Class	Positive Gap

Main Takeaways

RLHF is preferred when the reward function is sparse or simpler than the policy, as it is more sample-efficient.
DPO is preferred when the reward model is likely mis-specified (e.g., too small or hard to learn) but the policy model is expressive enough.
Online DPO can outperform both offline DPO and RLHF when both models are mis-specified but isomorphic, by closing the distribution shift gap.
The 'token-level' parameterization of LLMs makes policies prone to mis-specification because the optimal DPO policy must capture a recursive 'Q-function', which is more complex than the reward function itself.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Bradley-Terry (BT) model
Kullback-Leibler (KL) divergence
Statistical learning theory (realizability, sample complexity)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a two-stage process of training a reward model on preferences, then optimizing a policy using RL (e.g., PPO)

DPO: Direct Preference Optimization—a method optimizing the policy directly from preferences by treating the implicit reward as a function of the policy, bypassing explicit reward modeling

Realizability: The assumption that the true function (reward or optimal policy) exists within the chosen family of models (e.g., neural networks of a certain size)

Model Mis-specification: The scenario where the true function (reward or policy) cannot be perfectly represented by the model class, leading to approximation errors

Online DPO: A variant of DPO where preference data is generated on-the-fly by the current policy rather than being fixed offline

Isomorphic: In this context, meaning the reward model class and policy model class have equivalent representational capacity (one can be mapped to the other)

PILAF Sampler: A specific sampling strategy for Online DPO that mixes standard sampling with importance sampling based on reward differences to better approximate the objective