Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

📝 Paper Summary

LLM Alignment Preference Optimization

The paper identifies that Direct Preference Optimization (DPO) incorrectly reduces the probability of preferred completions when edit distances are low, and proposes DPO-Positive (DPOP) to penalize this reduction, achieving state-of-the-art open-source performance.

Core Problem

Standard DPO loss can effectively minimize its objective by lowering the likelihood of the preferred completion, as long as the dispreferred completion is lowered even more, leading to model degradation.

Why it matters:

This phenomenon causes the model to 'forget' or reduce the probability of generating correct, high-quality responses even if they are preferred.
The issue is catastrophic in datasets where preferred and dispreferred pairs have low edit distance (e.g., math or reasoning), which are critical for advanced LLM capabilities.

Concrete Example: Consider a math pair: '2+2=4' (preferred) vs '2+2=5' (dispreferred). These differ by only one token. To maximize the ratio gap, DPO might suppress the probability of the shared prefix '2+2=' or the correct token '4' in the preferred sequence, provided it suppresses '5' in the dispreferred sequence more aggressively.

Key Novelty

DPO-Positive (DPOP)

Identifies a theoretical failure mode where DPO reduces the log-likelihood of preferred examples to satisfy the relative margin objective.
Proposes a modified loss function (DPOP) that adds a penalty term specifically to prevent the model from reducing the likelihood of the positive (preferred) completion relative to the reference model.

Architecture

Conceptual illustration of DPO's failure mode on token probabilities.

Evaluation Highlights

Smaug-72B achieves an average accuracy of 80.48% on the HuggingFace Open LLM Leaderboard, becoming the first open-source LLM to surpass 80%.
Smaug-72B improves by nearly 2% over the previous second-best open-source model on the HuggingFace Leaderboard.
DPOP-tuned models outperform standard DPO-tuned models on MT-Bench, a benchmark independent of the fine-tuning data.

Breakthrough Assessment

9/10

Identifies a fundamental theoretical flaw in a widely used method (DPO) and provides a fix that results in the first open-source model to cross the significant 80% threshold on the HF Leaderboard.

⚙️ Technical Details

Problem Definition

Setting: Offline preference optimization where a policy is trained on pairs of preferred and dispreferred completions.

Inputs: A prompt x and two completions: preferred y_w and dispreferred y_l.

Outputs: A fine-tuned language model policy π_θ that assigns higher probability to y_w than y_l.

Pipeline Flow

Supervised Fine-Tuning (Reference Model)
Preference Data Collection/Construction
DPOP Training Loop

System Modules

Base LLM

Initial pre-trained or SFT model used as the reference policy

Model or implementation: 7B, 34B, and 72B parameter models (specific base architecture not explicitly named in snippet, likely Qwen based on sizes)

DPOP Loss Calculation

Computes the gradient update preventing likelihood drop of preferred samples

Model or implementation: Mathematical function

Novel Architectural Elements

New loss function (DPOP) incorporating a penalty for negative log-likelihood shifts of the preferred example, specifically designed to counter DPO's failure mode on low-edit-distance data.

Modeling

Base Model: 7B, 34B, and 72B models (Smaug series)

Training Method: DPO-Positive (DPOP)

Objective Functions:

Purpose: Maximize margin between preferred and dispreferred while maintaining likelihood of preferred.

Formally: Standard DPO loss + Penalty term (referenced as Equation 3 in text, essentially penalizing when log(π_θ(y_w|x)) < log(π_ref(y_w|x))).

Training Data:

New preference datasets based on ARC, HellaSwag, and MetaMath
Standard preference datasets

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPO: DPO allows preferred probability to drop if dispreferred drops more; DPOP explicitly penalizes this drop.
vs. IPO: IPO handles overfitting; DPOP handles underfitting/probability degradation.
vs. PPO/RLHF: DPOP remains offline and differentiable like DPO, avoiding the complexity of explicit reward modeling.

Limitations

Theoretical analysis assumes simple edit distance cases (like edit distance 1) to prove the gradient failure mode.
Requires an effectively fine-tuned reference model (SFT) as initialization.
Relies on the quality of preference pairs; if 'preferred' is not actually satisfactory, preserving its probability might be undesirable.

Reproducibility

Code: https://github.com/abacusai/smaug

Code, models (Smaug-34B, Smaug-72B), datasets, and documentation are released at https://github.com/abacusai/smaug. The paper explicitly mentions open-sourcing these artifacts.

📊 Experiments & Results

Evaluation Setup

Comparison on standard benchmarks and contamination checks.

Benchmarks:

HuggingFace Open LLM Leaderboard (Aggregate Benchmark (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K))
MT-Bench (Multi-turn conversation evaluation (LLM-as-a-judge))

Metrics:

Average Accuracy
MT-Bench Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HuggingFace Open LLM Leaderboard	Average Accuracy	78.50	80.48	+1.98

Experiment Figures

Empirical evidence of probability reduction.

Main Takeaways

DPO leads to probability degradation of preferred completions, particularly when the edit distance between preferred and dispreferred pairs is small.
DPO-Positive (DPOP) effectively ameliorates this degradation by modifying the loss function.
Smaug-72B, trained with DPOP, achieves state-of-the-art performance for open-source models, surpassing the 80% average accuracy threshold on the HF Open LLM Leaderboard.
DPOP improvements generalize to benchmarks independent of fine-tuning data, such as MT-Bench, suggesting true capability gains rather than just leaderboard overfitting.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Language Model fine-tuning (SFT)
Familiarity with Reinforcement Learning from Human Feedback (RLHF)
Knowledge of Direct Preference Optimization (DPO) loss formulation

Key Terms

DPO: Direct Preference Optimization—an algorithm optimizing a language model to align with preferences by solving for the reward function implicitly, avoiding a separate reward model.

DPOP: DPO-Positive—the proposed variation of DPO that adds a loss term to penalize reducing the probability of preferred completions.

SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality demonstration data before preference alignment.

Edit Distance: A measure of how dissimilar two strings are (e.g., the number of token changes needed to transform one into the other).

Logits: The raw, unnormalized scores output by the final layer of the neural network before the softmax function converts them to probabilities.

RLHF: Reinforcement Learning from Human Feedback—a method to align models using a learned reward model and reinforcement learning algorithms like PPO.

Plackett-Luce model: A probabilistic model for ranking items, used as the theoretical basis for the implicit reward in DPO.