Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) LLM Alignment Reward-based vs. Reward-free optimization

PPO consistently outperforms DPO in challenging tasks like code generation when key implementation factors (advantage normalization, large batch size, EMA) are properly tuned, due to DPO's susceptibility to out-of-distribution shifts.

Core Problem

While DPO achieves state-of-the-art results on simple academic benchmarks, it often underperforms reward-based methods (PPO) in complex real-world applications and challenging tasks.

Why it matters:

Top industry models (ChatGPT, Claude) rely on PPO, yet academic research is shifting toward DPO, creating a disconnect between theory and practice.
Understanding the limitations of reward-free methods is crucial for aligning models on complex tasks where distribution shifts between preference data and model outputs are significant.
Incorrectly assuming DPO is superior may lead to suboptimal alignment strategies for logic-heavy domains like coding.

Concrete Example: In a synthetic experiment, when the preference dataset covers only a subset of responses (e.g., y1 vs y2), DPO can assign artificially high probability to an unseen, out-of-distribution response (y3) because it lacks the explicit KL regularization against a reference model that PPO enforces on generated data.

Key Novelty

Theoretical proof of DPO's OOD vulnerability + PPO implementation best practices

Theoretically proves that the set of solutions found by PPO is a proper subset of DPO, meaning DPO can find 'optimal' solutions that exploit out-of-distribution data (biasing unseen responses) which PPO avoids via explicit regularization.
Identifies critical implementation details for PPO (advantage normalization, large batch sizes, EMA) often missed in academic baselines, allowing it to surpass DPO.

Architecture

A synthetic experiment visualization showing probability distributions of responses.

Evaluation Highlights

PPO achieves state-of-the-art results on the CodeContest dataset with a 34B model, improving pass@1k from 16.4% to 22.4%, outperforming AlphaCode-41B.
On the SafeRLHF dataset, resolving distribution shift improves DPO safety rate by 16.4%, but PPO still outperforms DPO in helpfulness rewards.
PPO consistently surpasses DPO across diverse testbeds ranging from dialogue to challenging code generation competitions.

Breakthrough Assessment

8/10

Provides a strong counter-narrative to the prevailing trend of preferring DPO, backed by theoretical proofs and SOTA empirical results on hard tasks (coding).

⚙️ Technical Details

Problem Definition

Setting: Aligning Large Language Models with human preferences using preference datasets D={(x, yw, yl)}.

Inputs: Prompt x and a preference dataset of winning/losing response pairs.

Outputs: A policy (LLM) π_θ that generates responses y maximizing human preference reward while staying close to a reference model.

Pipeline Flow

SFT Phase: Fine-tune base model on demonstration data
Reward Modeling (PPO only): Train reward model on preference pairs
RL Phase (PPO): Generate responses, compute rewards, update policy via PPO
RL Phase (DPO): Update policy directly using preference pairs and reference model probabilities

System Modules

Policy Model (Actor)

Generates responses to prompts

Model or implementation: Llama-2 (7B, 13B, 70B) or CodeLlama-34B

Reward Model (Critic)

Evaluates generated responses (PPO only)

Model or implementation: Same architecture as Policy Model, with scalar head

Reference Model

Provides regularization to prevent policy drift

Model or implementation: Frozen copy of SFT model (or EMA updated version)

Novel Architectural Elements

Implementation of EMA (Exponential Moving Average) updates for the reference model within the PPO pipeline to improve stability
Integration of advantage normalization and large batch training specifically tuned for LLM scale

Modeling

Base Model: Llama-2 (7B, 13B, 70B) and CodeLlama-34B

Training Method: PPO (Proximal Policy Optimization) vs. DPO (Direct Preference Optimization)

Objective Functions:

Purpose: PPO Policy Loss.

Formally: Maximize E[min(r_t A_t, clip(ratio, 1-eps, 1+eps) A_t)] - beta * KL(pi_theta || pi_ref)
Purpose: DPO Loss.

Formally: Minimize -E[log sigmoid(beta * log(pi_theta(yw)/pi_ref(yw)) - beta * log(pi_theta(yl)/pi_ref(yl)))]

Adaptation: Full fine-tuning

Training Data:

SafeRLHF dataset (dialogue)
CodeContest dataset (code generation)

Key Hyperparameters:

advantage_normalization: Enabled (critical for PPO)
batch_size: Large batch size (critical for PPO)
reference_model_update: Exponential Moving Average (EMA)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPO: PPO avoids exploiting out-of-distribution responses via explicit KL regularization on generated samples, whereas DPO is restricted to the preference dataset distribution.
vs. AlphaCode: PPO with CodeLlama-34B achieves better pass@1k performance (22.4% vs 16.4%) despite being a smaller/different architecture.
vs. IPO [not cited in paper]: IPO is another DPO variant adding regularization; this paper focuses on the fundamental comparison between reward-based (PPO) and reward-free (DPO) paradigms.

Limitations

PPO is more complex to implement and tune compared to DPO.
DPO performance heavily depends on the distribution alignment between base model and preference data.
Theoretical analysis assumes a specific mapping between reward models and policies.
Experiments focus on dialogue and code; other modalities not tested.

Reproducibility

Code: https://github.com/openpsi-project/ReaLHF

Code is publicly available at https://github.com/openpsi-project/ReaLHF. Specific hyperparameters for PPO (advantage normalization, batch size) are emphasized as critical for reproduction.

📊 Experiments & Results

Evaluation Setup

Comparison on Dialogue (SafeRLHF) and Code Generation (CodeContest) tasks.

Benchmarks:

SafeRLHF (Dialogue Safety and Helpfulness)
CodeContest (Competitive Programming (Code Generation))

Metrics:

Safety Rate
Helpfulness Reward
Pass@1k (Code Generation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments on SafeRLHF demonstrate DPO's sensitivity to distribution shifts and PPO's superior helpfulness/safety trade-off.
SafeRLHF	Safety Rate	55.4	71.8	+16.4
SafeRLHF	Helpfulness Reward	-1.62	-1.2	+0.42
CodeContest results highlight PPO's dominance in challenging logic tasks.
CodeContest	Pass@1k	16.4	22.4	+6.0

Main Takeaways

DPO is theoretically a superset of PPO solutions but risks finding biased solutions that exploit out-of-distribution data.
DPO's performance is highly sensitive to the distribution shift between the SFT model and the preference dataset; mitigating this shift improves DPO but it still lags behind PPO.
Critical success factors for PPO include advantage normalization, large batch sizes, and EMA reference model updates.
PPO proves superior in challenging reasoning tasks like code generation, achieving SOTA on CodeContest.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Direct Preference Optimization (DPO)
Bradley-Terry preference model
Kullback-Leibler (KL) divergence

Key Terms

PPO: Proximal Policy Optimization—a reward-based RL method that explicitly learns a reward model and optimizes a policy using actor-critic updates.

DPO: Direct Preference Optimization—a reward-free method that optimizes the policy directly on preference data by deriving a closed-form solution for the optimal policy.

OOD: Out-of-Distribution—data samples (prompts or responses) that differ significantly from the training distribution.

SFT: Supervised Fine-Tuning—the initial phase of training an LLM on high-quality demonstration data before alignment.

EMA: Exponential Moving Average—a technique used here to update the reference model slowly, stabilizing training.

advantage normalization: Rescaling the advantage estimates in PPO to have zero mean and unit variance, stabilizing the policy gradient updates.

CodeContest: A challenging competitive programming dataset used for benchmarking code generation capabilities.

KL divergence: A statistical distance measure used to regularize the aligned model so it does not deviate too far from the base reference model.