More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

📝 Paper Summary

LLM Safety Alignment Synthetic Data Generation

Aligning models using their own self-generated outputs yields significantly better safety profiles than using synthetic data from stronger models (like GPT-4o), which encourages reward hacking via superficial stylistic cues.

Core Problem

Common alignment strategies that pair responses from strong models (e.g., GPT-4o) with weaker model outputs create a large distribution shift, causing the target model to learn superficial cues rather than robust safety constraints.

Why it matters:

Synthetic preference data is the standard for scaling alignment, but improper data construction leads to high vulnerability against jailbreaking attacks
Current assumptions that 'better teacher models yield better students' fail in safety alignment, wasting computational resources on stronger models that actually degrade safety performance
Models achieving low training loss on multi-model data may deceptively appear aligned while remaining highly susceptible to adversarial prompts due to reward hacking

Concrete Example: When a target model is trained on pairs where GPT-4o provides the 'chosen' response and the model itself provides the 'rejected' one, the model learns to associate the 'chosen' label with GPT-4o's writing style or formatting (superficial features) rather than the safety content itself. Consequently, when attacked, the model mimics the style but fails to refuse the harmful request.

Key Novelty

Self-Referential Safety Alignment (Self+RM)

Decouples data generation from preference labeling to show that models learn safety best from their own output distribution
Demonstrates that 'stronger' synthetic data (from GPT-4o) creates high linear separability between chosen/rejected pairs, which paradoxically leads to worse safety by allowing the model to exploit easy shortcuts (reward hacking)
Establishes a 'sweet spot' of linear separability where the distinction between safe and unsafe responses is difficult enough to force the model to learn meaningful safety concepts rather than surface-level patterns

Evaluation Highlights

Self+RM (self-generated data) consistently achieves the lowest Attack Success Rate (ASR) on AdvBench compared to all multi-model strategies (GPT-4o+Self, Peer+RM) across Llama, Mistral, and Qwen families
GPT-4o+Self data leads to extremely rapid training convergence (near-zero loss) but fails to translate this into safety, indicating severe reward hacking
Self+RM matches the general capability performance (ARC, HellaSwag, MMLU) of multi-model approaches while providing superior safety, effectively debunking the trade-off assumption

Breakthrough Assessment

8/10

Counterintuitive and impactful finding that contradicts the common practice of using GPT-4 for distillation/alignment. Provides a clear, lower-cost alternative (self-generation) that improves safety.

⚙️ Technical Details

Problem Definition

Setting: Safety alignment of Large Language Models via Direct Preference Optimization (DPO)

Inputs: Prompt x (potentially harmful or benign)

Outputs: Response y (safe or helpful)

Pipeline Flow

Input Prompt -> Target Model -> Response Generation

System Modules

Target Model

Generate response to user query

Model or implementation: Llama-3.1-8B-Instruct (or Llama-2-7B, Mistral-7B, Qwen2.5-7B)

Modeling

Base Model: Llama-3.1-8B-Instruct, Llama-2-7B, Mistral-7B-v0.3, Mistral-7B-v0.1, Qwen2.5-7B, Qwen2-7B

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer chosen responses over rejected ones relative to a reference model.

Formally: L_DPO(π_θ; π_ref) = -E_{(x, y_w, y_l) ~ D} [log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]

Training Data:

50K SFT samples from UltraFeedback + 50K from PKU-SafeRLHF
10,000 prompts for DPO preference generation
Comparison of methods: Self+RM vs HC+Self vs GPT-4o+Self vs Stronger+Self vs Peer+RM vs All+RM

Key Hyperparameters:

reward_model: OpenAssistant/reward-model-deberta-v3-large-v2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Orca/Intel methods: Shows that pairing strong teacher outputs with weak student outputs creates distribution shifts that harm safety
vs. UltraFeedback/SafeRLHF: Demonstrates that for *safety* specifically (unlike general capabilities), single-model self-generation (Self+RM) outperforms multi-model pooling
vs. RRM [cited]: RRM attempts to rebalance preference data to fix reward hacking; this paper proposes preventing it via self-generation to maintain distribution consistency

Limitations

Safety evaluation relies heavily on GPT-4o as a judge, which may have its own biases
Detailed hyperparameters (learning rate, batch size) are not explicitly listed in the main text snippet
Analysis focuses on 7B-8B scale models; while 13B/14B are tested in appendix, extremely large models (70B+) as the *target* of alignment are less explored

Reproducibility

Code: https://github.com/cacayaya/More-is-Less

Code is publicly available at github.com/cacayaya/More-is-Less. The paper identifies all datasets (UltraFeedback, PKU-SafeRLHF) and models used. Exact hyperparameters (learning rate, batch size) for DPO are referenced to Appendix B.2 (not in text).

📊 Experiments & Results

Evaluation Setup

Safety alignment evaluation using jailbreaking prompts and general capability benchmarks

Benchmarks:

AdvBench (Safety/Jailbreaking)
ARC (Reasoning)
HellaSwag (Commonsense Reasoning)
MMLU (Multi-task Language Understanding)
TruthfulQA (Truthfulness)

Metrics:

Attack Success Rate (ASR)
General Task Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Numeric results tables are not included in the provided text snippet. The paper reports that Self+RM achieves the lowest ASR (safest) while maintaining comparable general capability scores. Specific numeric deltas are unavailable in the source text.

Experiment Figures

Comparison of Attack Success Rate (ASR) across different preference data construction methods (Self+RM vs Multi-model approaches)

Analysis of training dynamics: (a) DPO training loss curves, (b) Linear separability scores, (c) ASR vs Linear Loss correlation

Main Takeaways

Self-generated preference data (Self+RM) significantly outperforms multi-model data (including GPT-4o generated) in safety alignment, achieving the lowest Attack Success Rates (ASR) across multiple model families.
Using stronger models (GPT-4o) as 'chosen' responses leads to rapid loss convergence but poor safety generalization, a symptom of reward hacking where the model learns style over substance.
There is a non-linear relationship between the linear separability of preference data and safety: 'moderate' separability (Self+RM) is optimal, while 'high' separability (GPT-4o+Self) allows the model to exploit superficial shortcuts.
General capabilities (MMLU, ARC, etc.) are largely unaffected by the choice of safety data source; Self+RM performs comparably to multi-model methods on non-safety tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RLHF and DPO
Familiarity with synthetic data generation for LLMs
Knowledge of LLM jailbreaking and safety evaluation

Key Terms

DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly from preference data without training an explicit reward model first

RLHF: Reinforcement Learning from Human Feedback—a technique to align models using rewards derived from human preferences

ASR: Attack Success Rate—the percentage of jailbreaking attempts that successfully elicit a harmful response from the model

Self+RM: A data creation strategy where the target model generates both chosen and rejected candidates, ranked by an external Reward Model

Reward Hacking: When a model learns to optimize the reward signal (or loss function) by exploiting flaws or shortcuts (like length or style) rather than achieving the intended goal (safety)

Linear Separability: A measure of how easily a simple linear classifier can distinguish between 'chosen' and 'rejected' examples in the data; high separability implies obvious, likely superficial differences

SFT: Supervised Fine-Tuning—the initial training phase on high-quality instruction data before preference optimization

Jailbreaking: Adversarial attacks designed to bypass an LLM's safety filters and elicit harmful or restricted content