Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

📝 Paper Summary

Vision-Language Model (VLM) Safety Jailbreak Attacks and Defenses Mechanistic Interpretability

VLM jailbreaks occur not because models fail to recognize harm, but because images induce a specific representation shift that steers the model into a distinct 'jailbreak state' separate from refusal.

Core Problem

Large Vision-Language Models (VLMs) are easily jailbroken by adding images to harmful prompts, even when the text is explicitly harmful, weakening safety alignment compared to LLMs.

Why it matters:

The 'safety perception failure' hypothesis (that VLMs don't recognize harm) is flawed because it relies on implicitly harmful data where text is benign.
Existing defenses often compromise utility on benign tasks or require heavy computation (e.g., external LLMs).
Visual modalities introduce a systematic vulnerability: simply adding a blank image can increase jailbreak success by ~28%.

Concrete Example: When a user asks 'How to make a bomb' with an image of a bomb, the VLM often provides a harmful response that includes a safety warning (e.g., 'Warning: this is dangerous... First, take...'), proving it recognized the harm but failed to refuse.

Key Novelty

Jailbreak-Related Shift Removal (JRS-Rem)

Identifies that jailbreak samples cluster in a distinct 'jailbreak state' in representation space, separate from 'refusal' and 'benign' states.
Defines a 'jailbreak direction' vector from the refusal centroid to the jailbreak centroid.
Proposes a defense (JRS-Rem) that projects the image-induced shift onto this direction and subtracts it, effectively steering the model back towards refusal without affecting benign prompts.

Architecture

Conceptual diagram of the Jailbreak-Related Shift (JRS). It visualizes the Refusal State and Jailbreak State in representation space.

Evaluation Highlights

Reduces Attack Success Rate (ASR) on HADES dataset from ~84% to ~15% on LLaVA-1.5-7B, outperforming baselines.
Maintains utility on benign benchmarks (MM-Vet, MME), with negligible performance drops compared to other defenses like safe-tuning.
Generalizes across multiple attack types (explicit, implicit, adversarial) and models (LLaVA, ShareGPT4V, InternVL).

Breakthrough Assessment

8/10

Provides a compelling mechanistic explanation for VLM jailbreaks (distinct state vs. perception failure) and a simple, effective, inference-time defense that preserves utility.

⚙️ Technical Details

Problem Definition

Setting: Defending against multimodal jailbreak attacks at inference time without retraining.

Inputs: Multimodal input x = [I, T] containing an image I and text prompt T.

Outputs: Safe text response (refusal) for harmful inputs; helpful response for benign inputs.

Pipeline Flow

Compute Image-Induced Shift (calculate difference between [Image, Text] and [Empty, Text] representations)
Project onto Jailbreak Direction (isolate the component responsible for the jailbreak)
Subtract JRS (remove this component from the original activation)
Generate Response (continue forward pass with modified representation)

System Modules

JRS Calculator (Defense Mechanism)

Compute the scalar projection of the total shift onto the pre-computed jailbreak direction.

Model or implementation: Vector projection

Activation Corrector (Defense Mechanism)

Subtract the jailbreak component from the current activation.

Model or implementation: Vector subtraction

Novel Architectural Elements

Inference-time intervention layer that specifically removes the projection of the visual shift onto a 'jailbreak vector', rather than suppressing all visual information.

Modeling

Base Model: Evaluated on LLaVA-1.5-7B, ShareGPT4V-7B, and InternVL-Chat-19B

Training Method: Inference-time steering (no training of weights)

Compute: Minimal overhead (vector subtraction at inference time); requires two forward passes (one multimodal, one text-only) to compute the shift.

Comparison to Prior Work

vs. Text-Only: JRS-Rem preserves visual information orthogonal to the jailbreak direction, maintaining utility.
vs. Zou et al. (2025) [Representation editing]: Zou et al. remove components along a benign-to-harmful direction; JRS-Rem removes components along a refusal-to-jailbreak direction, which is more specific.
vs. Liu et al. (2025) [Projecting to text space]: JRS-Rem isolates the specific harmful component rather than pulling the entire representation to text space.

Limitations

Requires a reference set of jailbreak/refusal pairs to compute the direction vector.
Requires two forward passes (multimodal + text-only) at inference time to calculate the shift, doubling inference cost.
Defense effectiveness depends on the transferability of the jailbreak direction across different types of attacks.

Reproducibility

Code: https://github.com/LeeQueue513/JRS-Rem

Code is publicly available. The method requires a small 'held-out' set of harmful/refusal samples to compute the steering vector (Jailbreak Direction) beforehand.

📊 Experiments & Results

Evaluation Setup

Evaluate safety (ASR) on harmful datasets and utility on benign VLM benchmarks.

Benchmarks:

HADES (Explicitly and implicitly harmful multimodal prompts)
MM-SafetyBench (Multimodal safety benchmark)
MM-Vet (Benign multimodal utility (reasoning))
MME (Benign multimodal utility (perception/cognition))

Metrics:

Attack Success Rate (ASR)
Benign Task Performance (Accuracy/Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
JRS-Rem significantly reduces Attack Success Rate (ASR) on LLaVA-1.5-7B compared to the base model, specifically explaining the vulnerability to added images.
HADES (Explicit)	ASR	84.38	15.63	-68.75
JRS-Rem maintains high performance on benign benchmarks, unlike baseline defenses that often degrade utility.
MM-Vet	Score	31.2	30.9	-0.3
HADES	Jailbreak-Related Shift	Low (near 0 normalized)	High (distinctly positive)	Significant separation

Experiment Figures

PCA visualization of internal representations for Benign, Jailbreak, and Refusal samples.

Magnitude of Jailbreak-Related Shift across model layers for different input types.

Main Takeaways

Jailbreaks form a distinct internal state, separable from refusals, implying the model 'knows' it is misbehaving.
The 'Jailbreak-Related Shift' metric correlates with ASR: images with more harmful information or higher semantic relevance induce larger shifts.
Removing this specific shift vector defends against attacks without stripping away all visual information, preserving benign utility.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and their architecture (LLM backbone + Vision Encoder)
Vector arithmetic in high-dimensional representation spaces (PCA, centroids, projections)
Jailbreak attacks (explicit vs. implicit harmful payloads)

Key Terms

Jailbreak State: A distinct cluster in the VLM's representation space where the model recognizes harm but generates a harmful response instead of refusing.

Refusal State: The region in representation space where the model successfully refuses a harmful query.

Jailbreak Direction: The vector pointing from the average representation of refusal samples to the average representation of jailbreak samples.

Jailbreak-Related Shift (JRS): The component of the image-induced representation shift (difference between multimodal and text-only representations) projected onto the jailbreak direction.

HADES: A dataset of explicitly and implicitly harmful multimodal prompts used for evaluating VLM safety.

ASR: Attack Success Rate—the percentage of harmful prompts that successfully trigger a harmful response from the model.

SD: Stable Diffusion—used here to generate images corresponding to harmful text prompts.

Typographic Attack: Using images containing text (e.g., harmful keywords written on a sign) to bypass safety filters.

Linear Probe: A simple linear classifier trained on internal model representations to test if different classes (jailbreak vs. refusal) are linearly separable.