WPO: Enhancing RLHF with Weighted Preference Optimization

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Direct Preference Optimization (DPO)

WPO improves off-policy preference optimization by reweighting training pairs based on their probability under the current policy, effectively simulating on-policy learning without the cost of new sampling.

Core Problem

Off-policy preference optimization (using data generated by other models) suffers from a distributional gap between the data-collecting policy and the target policy, causing the model to learn suboptimally from unlikely or irrelevant examples.

Why it matters:

Off-policy RL is crucial for scalability and cost efficiency, avoiding expensive online sampling during training
Treating all off-policy preference pairs equally (as DPO does) ignores that some are far more relevant to the current policy than others, leading to inefficient updates
Standard DPO often underperforms true on-policy RL due to this distribution mismatch

Concrete Example: Consider two preference pairs: one generated by the current policy and one by a very different model. DPO treats them equally. WPO recognizes the first is more probable under the current policy and assigns it higher weight, preventing the model from over-fitting to irrelevant, out-of-distribution examples.

Key Novelty

Weighted Preference Optimization (WPO) with Sampled Alignment

Reweights each preference pair in the DPO loss based on the likelihood of the response being generated by the current policy (simulating on-policy distribution)
Uses a 'sampled alignment' mechanism to normalize these weights, ensuring that actual on-policy outputs receive consistent high weights regardless of absolute model confidence
Achieves the performance benefits of on-policy RL while retaining the speed and cost-efficiency of off-policy training

Architecture

Comparison of weight distributions for different alignment strategies

Evaluation Highlights

Achieves 76.7% length-controlled winning rate against GPT-4-turbo on Alpaca Eval 2 using Gemma-2-9b-it in a hybrid setting, setting a new SOTA
Outperforms standard DPO by up to 5.6% on Alpaca Eval 2 in off-policy settings
Improves length-controlled winning rate by up to 14.9% over the SFT baseline

Breakthrough Assessment

8/10

Simple yet highly effective modification to DPO that addresses a fundamental theoretical gap (off-policy distribution shift) and delivers state-of-the-art empirical results on major benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Aligning Large Language Models (LLMs) to human preferences using static (off-policy) datasets

Inputs: Prompt x, preferred response y_w, dispreferred response y_l

Outputs: Optimized policy π_θ that maximizes expected reward/preference satisfaction

Pipeline Flow

Input Preference Batch (x, y_w, y_l)
Policy Evaluation (compute log-probs of y_w, y_l under current π_θ)
Weight Calculation (compute WPO weights based on probabilities)
Loss Computation (weighted DPO loss)

System Modules

Policy Model

Generates log probabilities for the preferred and dispreferred responses

Model or implementation: Mistral-7B-v0.1 / Gemma-2-9b-it / Llama-3-8B-Instruct

Weighting Mechanism

Calculates importance weights for each pair to simulate on-policy distribution

Model or implementation: Analytical function (Sampled Alignment)

Novel Architectural Elements

Dynamic reweighting module inserted into the DPO loss calculation flow that adjusts gradient magnitude based on generation probability

Modeling

Base Model: Mistral-7B-v0.1, Llama-3-8B-Instruct, Gemma-2-9b-it

Training Method: Weighted Preference Optimization (WPO)

Objective Functions:

Purpose: Optimize policy to satisfy preferences while staying close to reference, weighted by relevance.

Formally: L_WPO = -E_{(x, y_w, y_l) ~ D} [w(x, y_w) * log σ( β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)) )]

Training Data:

Ultrafeedback (60k pairs) for off-policy
Mix of Ultrafeedback + internal on-policy data for hybrid setting

Key Hyperparameters:

learning_rate: 5e-7 (Mistral/Llama-3), 1e-6 (Gemma-2)
beta: 0.01 (Mistral/Llama-3), 0.05 (Gemma-2)
batch_size: 128
+ 3 more
epochs: 1 (Ultrafeedback), 3 (On-policy data)
max_length: 2048
warmup_ratio: 0.1

Compute: 8x H100 GPUs for training

Comparison to Prior Work

vs. DPO: WPO adds a probability-based weight term w(x,y) to the loss
vs. PPO: WPO is off-policy and does not require complex online rollout generation and value function estimation
vs. RSO [not cited in paper]: RSO also addresses distribution shift but via reject sampling on the optimal policy; WPO does it via reweighting the current policy's likelihoods

Limitations

Relies on the quality of the base model's probability estimation; poor calibration could lead to noisy weights
Computational overhead of computing weights (though minimal compared to generation)
Hybrid setting still requires some on-policy data generation, which incurs cost

Reproducibility

Code: https://github.com/wzhouad/WPO

Publicly available code (GitHub) and models. Hyperparameters for specific models (Mistral, Llama-3, Gemma-2) are detailed in the appendix.

📊 Experiments & Results

Evaluation Setup

Instruction following and conversation capability evaluation

Benchmarks:

Alpaca Eval 2 (Instruction following (LLM-based judge))
MT-bench (Multi-turn conversation (LLM-based judge))

Metrics:

Length-controlled winning rate (LC Win Rate)
Raw winning rate (Win Rate)
MT-bench Score (1-10)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing WPO against DPO and SFT baselines across different base models on Alpaca Eval 2.
Alpaca Eval 2	LC Win Rate	14.7	32.0	+17.3
Alpaca Eval 2	LC Win Rate	26.4	32.0	+5.6
Alpaca Eval 2	LC Win Rate	51.3	76.7	+25.4
MT-bench	Score	6.24	7.66	+1.42
Ablation study on weight alignment strategies.
Alpaca Eval 2	LC Win Rate	28.5	30.0	+1.5

Experiment Figures

LC Winning Rate vs. Length on Alpaca Eval 2 for various models including WPO

Main Takeaways

WPO consistently outperforms DPO across multiple model families (Mistral, Llama-3, Gemma-2)
The 'hybrid' setting (combining off-policy data with on-policy samples) yields the highest performance, verifying the value of on-policy signals
Sampled alignment is crucial for stabilizing the weights and ensuring effective simulation of on-policy learning
WPO is effective even when applied to other loss functions like IPO and SimPO, showing it is a generalizable enhancement

📚 Prerequisite Knowledge

Prerequisites

Understanding of RLHF (Reinforcement Learning from Human Feedback)
Familiarity with DPO (Direct Preference Optimization)
Concepts of on-policy vs. off-policy reinforcement learning

Key Terms

DPO: Direct Preference Optimization—a stable method for training language models on preference pairs without an explicit reward model loop

RLHF: Reinforcement Learning from Human Feedback—aligning models by training them to maximize a reward signal derived from human preferences

off-policy: Training a model using data generated by a different model (the behavior policy) rather than the model currently being trained

on-policy: Training a model using data generated by the model itself during the training process

distributional gap: The difference between the statistical distribution of data used for training and the distribution of data the model would naturally generate

Alpaca Eval 2: A benchmark for evaluating instruction-following capabilities of LLMs using an LLM-based judge to compare against a baseline

MT-bench: A benchmark consisting of multi-turn conversation questions to evaluate LLMs on reasoning, coding, and roleplay

SFT: Supervised Fine-Tuning—the initial training phase where a model learns to follow instructions from labeled examples before RLHF

hybrid RL setting: A training setup that mixes static off-policy preference data with new on-policy samples generated by the current model