← Back to Paper List

Are PPO-ed Language Models Hackable?

Suraj Anand, David Getzen
arXiv (2024)
RL Factuality

📝 Paper Summary

AI Safety Mechanistic Interpretability Adversarial Attacks
PPO aligns models by learning a superficial wrapper that suppresses undesirable activations rather than removing them, allowing adversaries to restore negative behaviors by mechanically amplifying specific internal weights.
Core Problem
Reinforcement learning alignment methods like PPO (Proximal Policy Optimization) often fail to unlearn undesirable concepts, instead learning an offset that masks them, leaving models vulnerable to mechanistic jailbreaks.
Why it matters:
  • Aligned models may retain toxic or biased capabilities in their weights, creating a false sense of safety
  • Adversaries with white-box access could bypass safety filters by manipulating specific internal activations identified through interpretability techniques
  • Current reward modeling approaches may not sufficiently penalize the presence of latent negative knowledge, only its expression
Concrete Example: A GPT-2 model trained via PPO to generate positive movie reviews (average sentiment 0.80) still contains 'negative' value vectors. By manually scaling these vectors by 10x during inference, the model reverts to generating negative reviews (sentiment 0.43), effectively bypassing the PPO alignment.
Key Novelty
Mechanistic Jailbreak of PPO-Aligned Models
  • Uses mechanistic interpretability (linear probes and value-vector analysis) to locate specific weights responsible for negative sentiment in a pre-trained model
  • Demonstrates that PPO alignment preserves these negative weights (cosine similarity ≥ 0.9998) and merely learns to suppress their activation
  • Proposes a 'hack' that manually amplifies these suppressed negative vectors during inference to force the model to output negative sentiment despite alignment
Evaluation Highlights
  • PPO alignment successfully raised GPT-2's average sentiment score from 0.27 (baseline) to 0.80 on a held-out prompt set
  • The mechanistic 'hack' (scaling negative value vectors) reduced the PPO-aligned model's sentiment score from 0.80 down to 0.43
  • Post-PPO weights maintained a cosine similarity of ≥ 0.9998 with the original weights, confirming PPO makes minimal structural changes
Breakthrough Assessment
5/10
Provides a useful mechanistic confirmation that PPO learns offsets rather than unlearning, but the scope is limited to GPT-2 sentiment and the proposed defense (weight penalty) was unstable.
×