Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

📝 Paper Summary

LLM Alignment Reinforcement Learning from Human Feedback (RLHF) Direct Preference Optimization (DPO)

DPO-COV introduces a unified objective function that simultaneously handles noisy data, reward hacking, and output verbosity in both offline and online alignment settings with provable generalization guarantees.

Core Problem

Current alignment methods like RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) suffer from three distinct failures: learning from corrupted labels, generating high-reward but low-quality text (overoptimization), and preferring excessively long responses (verbosity).

Why it matters:

Real-world preference data is often noisy or malicious, misleading models into generating harmful content if not filtered
Models often 'game' the reward function, producing gibberish or repetitive text that scores high mathematically but is useless to humans
Standard alignment tends to bias models toward verbose answers, wasting compute and degrading user experience
Existing solutions typically address only one issue at a time or require computationally expensive reward ensembles

Concrete Example: When fine-tuning for content moderation, a malicious annotator might label hate speech as 'preferred.' A vanilla DPO model would learn to generate hate speech (Corruption). Simultaneously, the model might learn that longer responses get higher scores, producing a 500-word essay where a 'Yes/No' suffices (Verbosity) that looks confident but is factually wrong (Overoptimization).

Key Novelty

RLHF-COV / DPO-COV (Corruption, Overoptimization, Verbosity)

Integrates a sparse noise model directly into the loss to absorb label corruption, preventing the policy from learning from outliers
Applies a pessimistic regularizer (offline) to penalize out-of-distribution samples and an optimistic regularizer (online) to encourage exploration, mitigating overoptimization
Incorporates an explicit length penalty into the value function formulation to counteract the model's natural bias toward verbosity

Evaluation Highlights

Achieves 7.61% Length-Controlled win rate against GPT-4, outperforming vanilla DPO (6.29%) and single-issue baselines on the Argilla-DPO-Mix-7K dataset
Proves generalization error rates of O(log(N)/√N) for offline training on corrupted data, matching theoretical rates for clean data
Demonstrates mathematical equivalence between the proposed RLHF-COV (reward modeling) and DPO-COV (direct policy) formulations

Breakthrough Assessment

8/10

Provides a theoretically grounded unification of three major alignment problems. The proof of generalization under corruption is significant, though empirical evaluation is limited to one offline dataset in the main text.

⚙️ Technical Details

Problem Definition

Setting: Offline and Online Reinforcement Learning from Human Feedback (RLHF) under corrupted preference labels

Inputs: Prompt x and a pair of responses (winning, losing) with potentially flipped labels due to noise

Outputs: Aligned Language Model Policy π

Pipeline Flow

Input: Corrupted Preference Data D
DPO-COV Optimization (Single Stage)

System Modules

DPO-COV Optimization

Optimize policy using a modified DPO loss that handles noise, length, and distribution shift

Model or implementation: Zephyr-7b-gemma-sft-v0.1

Novel Architectural Elements

Unified loss function incorporating L1 noise regularization (λ), length penalty (ω), and pessimistic/optimistic value regularization (η) directly into the DPO update

Modeling

Base Model: zephyr-7b-gemma-sft-v0.1 (fine-tuned gemma-7b)

Training Method: Offline DPO-COV (Algorithm 1)

Objective Functions:

Purpose: Mitigate corruption, overoptimization, and verbosity simultaneously.

Formally: Minimize L_{N,λ}(r_π, ξ_π) + ηV_{β,ω}(π_{r_π}, r_π), where ξ is noise, ω is length penalty, and η is pessimism coefficient.

Adaptation: LoRA (Low-Rank Adaptation)

Key Hyperparameters:

beta: 0.05
learning_rate: 5e-7
optimizer: AdamW
+ 4 more
epochs: 2
lambda: 0.7 (noise regularization)
eta: 0.0005 (pessimism)
omega: 0.0005 (length penalty)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Robust DPO: Adds pessimism and length penalties to handle all three issues simultaneously
vs. Pessimistic DPO: Adds noise modeling (λ) to handle corrupted data
vs. Reward Ensembles (Coste et al., 2024): Achieves robustness and regularization without the computational cost of training multiple reward models

Limitations

Cannot completely remove hallucinations involving false/unsafe information
Experiments limited to simple Question-Answering tasks (Argilla dataset)
Requires tuning three additional hyperparameters (λ, η, ω) compared to vanilla DPO

Reproducibility

No code URL provided. Hyperparameters for the specific experiment are listed (Table 1), including β, λ, η, and ω. Base model and reference model are public (HuggingFace). Implementation relies on modifying standard DPO loss.

📊 Experiments & Results

Evaluation Setup

Offline alignment on preference data, evaluated against GPT-4 Preview

Benchmarks:

Argilla-DPO-Mix-7K (Chat/Instruction Following)

Metrics:

Length-Controlled win rate (LC-win rate)
Statistical methodology: Grid search on validation set; no statistical significance tests reported for win rates

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation study comparing full DPO-COV against single-component baselines and vanilla DPO on the Argilla dataset.
Argilla-DPO-Mix-7K	LC-win rate	6.29	7.61	+1.32
Argilla-DPO-Mix-7K	LC-win rate	7.04	7.61	+0.57
Argilla-DPO-Mix-7K	LC-win rate	5.50	7.61	+2.11
Argilla-DPO-Mix-7K	LC-win rate	4.92	7.61	+2.69

Main Takeaways

Simultaneously addressing corruption, overoptimization, and verbosity yields higher win rates than addressing any single issue in isolation
Pessimistic DPO alone (handling only overoptimization) performed worse than Vanilla DPO in this setting, suggesting that without handling noise or length, pessimism might be overly conservative or misdirected
Theoretical results confirm that DPO-COV maintains optimal generalization rates even when data is corrupted, provided appropriate regularization is used

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Bradley-Terry preference model
Generalization error bounds

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to align language models by training a reward model on human preferences and optimizing the policy via RL

DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without training a separate explicit reward model

Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on their latent reward scores

Overoptimization: Also known as reward hacking; when a model exploits the reward function to get a high score without actually improving generation quality

Verbosity bias: The tendency of language models and reward models to irrationally favor longer responses regardless of quality

LC-win rate: Length-Controlled win rate—an evaluation metric that adjusts for response length to prevent longer answers from automatically winning

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices

MLE: Maximum Likelihood Estimation—a method for estimating the parameters of a statistical model

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution