← Back to Paper List

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Hyung Gyu Rho
arXiv (2025)
RL P13N

📝 Paper Summary

Direct Preference Optimization (DPO) Reward Modeling
MADPO improves model alignment by using a pre-trained reward model to dynamically re-weight the optimization loss, amplifying learning from hard preference pairs while stabilizing training on easy ones.
Core Problem
Standard DPO uses a fixed temperature parameter that forces a compromise: it either overfits to easy, high-margin data or under-learns from hard, low-margin data.
Why it matters:
  • Fixed parameters cannot reconcile the tension between conservative updates needed for easy pairs and aggressive updates needed for hard pairs
  • Existing adaptive methods like IPO are overly conservative, while beta-DPO introduces instability and data inefficiency by filtering useful samples
Concrete Example: If annotators unanimously prefer Response A over B (easy pair), DPO pushes the probability ratio arbitrarily high, leading to overfitting. Conversely, for subtle preferences (hard pair), a conservative DPO update fails to capture the signal.
Key Novelty
Margin-Adaptive Direct Preference Optimization (MADPO)
  • Uses a two-step process: first trains a standard reward model to estimate preference margins, then uses these margins to modulate the policy training loss
  • Employes a piecewise weight function that acts as an amplifier for low-margin (hard) pairs and a dampener for high-margin (easy) pairs
  • Introduces a stability mechanism that caps weights for negative margins to prevent gradient explosion on mislabeled or adversarial data
Evaluation Highlights
  • Achieves +33.3% performance gain on High Quality sentiment generation data compared to the next-best method (beta-DPO)
  • Achieves +10.5% performance gain on Low Quality sentiment generation data compared to beta-DPO
  • Demonstrates robustness to reward model estimation errors through theoretical analysis and empirical validation
Breakthrough Assessment
7/10
Offers a principled, theoretically grounded improvement over standard DPO with significant empirical gains. While an incremental evolution of DPO, the granular control mechanism addresses a well-known stability-plasticity dilemma in alignment.
×