← Back to Paper List

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
University of Michigan, Harvard University, University of Sydney
arXiv (2024)
RL Factuality Pretraining

📝 Paper Summary

Mechanistic Interpretability AI Alignment Safety & Toxicity
DPO aligns models not by removing toxic capabilities, but by learning offset vectors that bypass toxic regions in the representation space, a mechanism easily reversed to restore toxicity.
Core Problem
While alignment algorithms like DPO reduce unwanted behaviors (e.g., toxicity), the internal mechanisms driving this reduction remain unknown, leaving models vulnerable to jailbreaks.
Why it matters:
  • Jailbreaks easily undo safety alignment, suggesting current methods may be superficial rather than fundamental fixes.
  • Without mechanistic understanding, we cannot predict failure modes or guarantee safety in deployed systems.
  • Prior work hypothesizes about jailbreaks empirically but lacks a causal explanation based on internal model weights and activations.
Concrete Example: A pre-trained model generates toxic text when prompted. After DPO alignment, it refuses. However, simply subtracting a specific vector from the model's weights reactivates the original toxic behavior, proving the capability was never removed.
Key Novelty
Bypassing Mechanism of DPO
  • Identifies specific 'toxic vectors' in the model's MLP layers that encode toxicity.
  • Demonstrates that DPO (Direct Preference Optimization) barely changes these toxic vectors but learns an 'offset' that steers activations around them.
  • Shows that alignment can be 'undone' by surgically re-activating these bypassed vectors, restoring toxicity without full fine-tuning.
Evaluation Highlights
  • Intervening with identified toxic vectors on pre-trained GPT2 reduces toxicity probability, validating their causal role.
  • DPO alignment effectively reduces toxicity, but shifting the model weights back by a simple scalar un-aligns it, restoring high toxicity levels.
  • The 'un-alignment' method works by reversing the specific spectral components (singular vectors) that DPO altered, proving the locality of the alignment mechanism.
Breakthrough Assessment
7/10
Provides a strong mechanistic explanation for why jailbreaks occur (capability bypassing vs. removal), though experiments are limited to GPT2-medium and toxicity.
×