Are PPO-ed Language Models Hackable?

📝 Paper Summary

AI Safety Mechanistic Interpretability Adversarial Attacks

PPO aligns models by learning a superficial wrapper that suppresses undesirable activations rather than removing them, allowing adversaries to restore negative behaviors by mechanically amplifying specific internal weights.

Core Problem

Reinforcement learning alignment methods like PPO (Proximal Policy Optimization) often fail to unlearn undesirable concepts, instead learning an offset that masks them, leaving models vulnerable to mechanistic jailbreaks.

Why it matters:

Aligned models may retain toxic or biased capabilities in their weights, creating a false sense of safety
Adversaries with white-box access could bypass safety filters by manipulating specific internal activations identified through interpretability techniques
Current reward modeling approaches may not sufficiently penalize the presence of latent negative knowledge, only its expression

Concrete Example: A GPT-2 model trained via PPO to generate positive movie reviews (average sentiment 0.80) still contains 'negative' value vectors. By manually scaling these vectors by 10x during inference, the model reverts to generating negative reviews (sentiment 0.43), effectively bypassing the PPO alignment.

Key Novelty

Mechanistic Jailbreak of PPO-Aligned Models

Uses mechanistic interpretability (linear probes and value-vector analysis) to locate specific weights responsible for negative sentiment in a pre-trained model
Demonstrates that PPO alignment preserves these negative weights (cosine similarity ≥ 0.9998) and merely learns to suppress their activation
Proposes a 'hack' that manually amplifies these suppressed negative vectors during inference to force the model to output negative sentiment despite alignment

Evaluation Highlights

PPO alignment successfully raised GPT-2's average sentiment score from 0.27 (baseline) to 0.80 on a held-out prompt set
The mechanistic 'hack' (scaling negative value vectors) reduced the PPO-aligned model's sentiment score from 0.80 down to 0.43
Post-PPO weights maintained a cosine similarity of ≥ 0.9998 with the original weights, confirming PPO makes minimal structural changes

Breakthrough Assessment

5/10

Provides a useful mechanistic confirmation that PPO learns offsets rather than unlearning, but the scope is limited to GPT-2 sentiment and the proposed defense (weight penalty) was unstable.

⚙️ Technical Details

Problem Definition

Setting: Controlled generation of positive sentiment text using a transformer language model

Inputs: Text prompts (from IMDb dataset)

Outputs: Generated text sequences with positive sentiment

Pipeline Flow

GPT-2 (Pre-trained)
Reward Model (DistilBERT)
PPO Training Loop
Mechanistic Analysis (Post-Training)
Inference Hack (Activation Scaling)

System Modules

GPT-2

Policy model generating text responses

Model or implementation: GPT-2 (fine-tuned on IMDb)

Reward Model

Evaluates the sentiment of generated text to provide feedback for PPO

Model or implementation: DistilBERT (fine-tuned on IMDb)

Value Vector Scaler

Intervention mechanism to amplify negative concepts

Model or implementation: Linear scaling operation

Modeling

Base Model: GPT-2

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize sentiment reward while constraining policy change.

Formally: L(θ) = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]
Purpose: Experimental defense to penalize negative weights directly.

Formally: Reward_final = Reward_PPO - λ_2 * Σ|w_neg| (where w_neg are identified negative weights)

Training Data:

Stanford Large Movie Review Dataset (IMDb)
50,000 polar film reviews

Key Hyperparameters:

lambda_2: 10^-4 (for weight penalty experiment)
scaling_factor: 10 (for inference hack)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Lee et al. (2024): Extends the finding that DPO learns an offset to PPO, confirming this phenomenon is not unique to DPO
vs. Standard PPO: Introduces a weight-targeted penalty term in the reward function to attempt (unsuccessfully) to remove negative weights rather than just suppress them

Limitations

The defense mechanism (adding a weight penalty term) was unstable and failed to find a hyperparameter that removed negative weights without breaking language function
Experiments were limited to a controlled positive/negative sentiment setting on GPT-2, which may not fully transfer to complex toxicity in larger LLMs
The 'hack' requires white-box access to model weights to identify and scale negative value vectors

Reproducibility

The paper mentions using the 'trlx' library for PPO but does not provide a specific repository for the project code. The analysis relies on standard interpretability techniques (Logit Lens) and public datasets (IMDb).

📊 Experiments & Results

Evaluation Setup

Sentiment generation task using IMDb movie reviews

Benchmarks:

IMDb Sentiment Analysis (Text Generation / Sentiment Classification)

Metrics:

Average Sentiment Score (0.0 to 1.0, higher is more positive)
Cosine Similarity (of weights)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
IMDb Sentiment	Average Sentiment Score	0.27	0.80	+0.53
IMDb Sentiment	Average Sentiment Score	0.80	0.43	-0.37
Internal Analysis	Cosine Similarity	1.00	0.9998	-0.0002

Experiment Figures

Histogram of cosine similarities between original and PPO-ed weights

Delta in activations for top-k negative value vectors before and after PPO

Main Takeaways

PPO functions by learning a 'wrapper' or offset that suppresses undesirable activations rather than unlearning the underlying concepts.
Negative concepts remain stored in the model's weights (specifically value vectors in layers 6-9) even after successful alignment.
It is possible to mechanistically 'hack' an aligned model by identifying and amplifying these dormant negative weights, forcing the model to ignore its alignment.
Attempts to fix this by directly penalizing negative weights in the reward function caused model instability, highlighting the difficulty of 'unlearning' via RL.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (MLP layers, residual stream)
Reinforcement Learning from Human Feedback (RLHF)
Mechanistic Interpretability concepts (Logit Lens, Value Vectors)

Key Terms

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to align language models by updating policy weights to maximize a reward signal while limiting deviation from the original policy

Logit Lens: An interpretability technique that projects intermediate layer activations directly into the vocabulary space to see what token the model is 'thinking' of at a specific layer

Value Vectors: The output vectors from the MLP (Multilayer Perceptron) sub-layers in a transformer that are added to the residual stream to promote or suppress specific tokens

Residual Stream: The main vector pathway in a transformer where information is processed and updated by attention and MLP layers

Mechanistic Interpretability: A field of AI safety focused on reverse-engineering neural networks into human-understandable algorithms by analyzing neurons, weights, and activations