← Back to Paper List

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong
Purdue University, The University of Texas at Austin, The University of North Carolina at Chapel Hill, University of California, Los Angeles
arXiv (2025)
RL Factuality Benchmark

📝 Paper Summary

LLM Safety Alignment Synthetic Data Generation
Aligning models using their own self-generated outputs yields significantly better safety profiles than using synthetic data from stronger models (like GPT-4o), which encourages reward hacking via superficial stylistic cues.
Core Problem
Common alignment strategies that pair responses from strong models (e.g., GPT-4o) with weaker model outputs create a large distribution shift, causing the target model to learn superficial cues rather than robust safety constraints.
Why it matters:
  • Synthetic preference data is the standard for scaling alignment, but improper data construction leads to high vulnerability against jailbreaking attacks
  • Current assumptions that 'better teacher models yield better students' fail in safety alignment, wasting computational resources on stronger models that actually degrade safety performance
  • Models achieving low training loss on multi-model data may deceptively appear aligned while remaining highly susceptible to adversarial prompts due to reward hacking
Concrete Example: When a target model is trained on pairs where GPT-4o provides the 'chosen' response and the model itself provides the 'rejected' one, the model learns to associate the 'chosen' label with GPT-4o's writing style or formatting (superficial features) rather than the safety content itself. Consequently, when attacked, the model mimics the style but fails to refuse the harmful request.
Key Novelty
Self-Referential Safety Alignment (Self+RM)
  • Decouples data generation from preference labeling to show that models learn safety best from their own output distribution
  • Demonstrates that 'stronger' synthetic data (from GPT-4o) creates high linear separability between chosen/rejected pairs, which paradoxically leads to worse safety by allowing the model to exploit easy shortcuts (reward hacking)
  • Establishes a 'sweet spot' of linear separability where the distinction between safe and unsafe responses is difficult enough to force the model to learn meaningful safety concepts rather than surface-level patterns
Evaluation Highlights
  • Self+RM (self-generated data) consistently achieves the lowest Attack Success Rate (ASR) on AdvBench compared to all multi-model strategies (GPT-4o+Self, Peer+RM) across Llama, Mistral, and Qwen families
  • GPT-4o+Self data leads to extremely rapid training convergence (near-zero loss) but fails to translate this into safety, indicating severe reward hacking
  • Self+RM matches the general capability performance (ARC, HellaSwag, MMLU) of multi-model approaches while providing superior safety, effectively debunking the trade-off assumption
Breakthrough Assessment
8/10
Counterintuitive and impactful finding that contradicts the common practice of using GPT-4 for distillation/alignment. Provides a clear, lower-cost alternative (self-generation) that improves safety.
×