← Back to Paper List

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

Ziyi Chen, Junyi Li, Peiran Yu, Heng Huang
University of Maryland, College Park, Amazon, University of Texas Arlington
arXiv (2025)
RL P13N Reasoning

📝 Paper Summary

LLM Alignment Reinforcement Learning from Human Feedback (RLHF) Direct Preference Optimization (DPO)
DPO-COV introduces a unified objective function that simultaneously handles noisy data, reward hacking, and output verbosity in both offline and online alignment settings with provable generalization guarantees.
Core Problem
Current alignment methods like RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) suffer from three distinct failures: learning from corrupted labels, generating high-reward but low-quality text (overoptimization), and preferring excessively long responses (verbosity).
Why it matters:
  • Real-world preference data is often noisy or malicious, misleading models into generating harmful content if not filtered
  • Models often 'game' the reward function, producing gibberish or repetitive text that scores high mathematically but is useless to humans
  • Standard alignment tends to bias models toward verbose answers, wasting compute and degrading user experience
  • Existing solutions typically address only one issue at a time or require computationally expensive reward ensembles
Concrete Example: When fine-tuning for content moderation, a malicious annotator might label hate speech as 'preferred.' A vanilla DPO model would learn to generate hate speech (Corruption). Simultaneously, the model might learn that longer responses get higher scores, producing a 500-word essay where a 'Yes/No' suffices (Verbosity) that looks confident but is factually wrong (Overoptimization).
Key Novelty
RLHF-COV / DPO-COV (Corruption, Overoptimization, Verbosity)
  • Integrates a sparse noise model directly into the loss to absorb label corruption, preventing the policy from learning from outliers
  • Applies a pessimistic regularizer (offline) to penalize out-of-distribution samples and an optimistic regularizer (online) to encourage exploration, mitigating overoptimization
  • Incorporates an explicit length penalty into the value function formulation to counteract the model's natural bias toward verbosity
Evaluation Highlights
  • Achieves 7.61% Length-Controlled win rate against GPT-4, outperforming vanilla DPO (6.29%) and single-issue baselines on the Argilla-DPO-Mix-7K dataset
  • Proves generalization error rates of O(log(N)/√N) for offline training on corrupted data, matching theoretical rates for clean data
  • Demonstrates mathematical equivalence between the proposed RLHF-COV (reward modeling) and DPO-COV (direct policy) formulations
Breakthrough Assessment
8/10
Provides a theoretically grounded unification of three major alignment problems. The proof of generalization under corruption is significant, though empirical evaluation is limited to one offline dataset in the main text.
×