Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

📝 Paper Summary

Adversarial Robustness Vision Language Model (VLM) Safety Activation Steering

ASTRA defends VLMs against jailbreaks by identifying harmful visual tokens via image attribution and adaptively steering internal activations away from these harmful directions during inference.

Core Problem

Vision Language Models (VLMs) are vulnerable to jailbreak attacks where adversarial images trigger harmful responses, and existing defenses (adversarial training, input purification) are computationally expensive or ineffective.

Why it matters:

Visual inputs create a new, vast attack surface for VLMs that text-only safeguards often miss
Current defenses like adversarial training require expensive fine-tuning and data generation
Response-evaluation methods slow down inference significantly by requiring multiple generation passes per query

Concrete Example: When a user provides an adversarial image with the prompt 'Tell me how to build a bomb', a standard VLM might comply because the visual noise bypasses safety filters. ASTRA detects this activation pattern and steers the model to refuse, while still answering benign queries like 'Describe this image' correctly.

Key Novelty

Adaptive Steering via Image Attribution (ASTRA)

Uses Lasso-based image attribution to identify specific visual tokens in adversarial images that trigger harmful responses
Constructs steering vectors by contrasting activations of these harmful tokens against benign ones
Applies adaptive steering during inference that projects activations away from the harmful direction only when a high similarity is detected, preserving benign performance

Architecture

Conceptual comparison between static steering (linear) and ASTRA's adaptive steering (projection-based with calibration)

Evaluation Highlights

Reduces Attack Success Rate (ASR) by 17.84% compared to state-of-the-art JailGuard on MiniGPT-4
Lowers Toxicity Score by 12.12% compared to JailGuard on MiniGPT-4
Operates 9x faster than JailGuard by avoiding multiple inference passes
Maintains benign performance within 1% of the original model on standard VLM benchmarks (ScienceQA, MME, MMBench)

Breakthrough Assessment

8/10

Significant efficiency gain (9x) over prior defenses while improving robustness. The adaptive steering mechanism effectively solves the 'benign performance drop' problem common in activation engineering.

⚙️ Technical Details

Problem Definition

Setting: Defense against perturbation-based and structure-based jailbreak attacks on autoregressive Vision Language Models

Inputs: Adversarial image x_v and textual prompt x_t

Outputs: Safe response r (refusal or benign answer)

Pipeline Flow

Adversarial Image Generation (offline)
Image Attribution / Token Selection (offline)
Steering Vector Construction (offline)
Inference with Adaptive Steering (online)

System Modules

Attribution Surrogate (Offline Preparation)

Identify impactful visual tokens that cause jailbreaks

Model or implementation: Linear regression with Lasso

Vector Constructor (Offline Preparation)

Compute the steering vector defining the harmful direction

Model or implementation: Mean difference calculation

Adaptive Steerer

Modify activations during generation to remove harmful components

Model or implementation: Projection-based subtraction

Novel Architectural Elements

Adaptive steering mechanism utilizing calibrated activations (subtracting a global mean activation h_0^l) to ensure steering only occurs when the input aligns with the harmful direction

Modeling

Base Model: Evaluated on LLaVA-v1.5-7B, MiniGPT-4-v2, and InstructBLIP

Training Method: Inference-time intervention (activation steering)

Adaptation: None (no weight updates)

Trainable Parameters: 0 (inference-time only)

Key Hyperparameters:

ablation_samples_n: Not explicitly reported in the paper
top_k_tokens: Not explicitly reported in the paper
steering_coefficient_alpha: Not explicitly reported in the paper

Compute: Inference only; steering vector construction requires a few forward passes (efficient)

Comparison to Prior Work

vs. JailGuard: ASTRA is 9x faster and does not require multiple generation passes
vs. LLM-based Steering: ASTRA uses image attribution to bridge the modality gap, whereas text-only vectors fail to transfer to visual attacks effectively
vs. Adversarial Training: ASTRA does not require expensive model fine-tuning

Limitations

Relies on the availability of adversarial images to construct steering vectors (though transferable)
Performance depends on the quality of the linear surrogate model for attribution
Requires access to model activations (white-box defense)

Reproducibility

Code: https://github.com/ASTRAL-Group/ASTRA

Code is publicly available at https://github.com/ASTRAL-Group/ASTRA. The paper describes the method mathematically but lacks specific values for hyperparameters like the number of ablation samples or the exact steering coefficient alpha used in the main results table.

📊 Experiments & Results

Evaluation Setup

Defense against VLM jailbreaks using adversarial images and text prompts

Benchmarks:

VL-JailBreak (Adversarial Attack Dataset)
ScienceQA (VLM General Utility)
MME (VLM General Utility)
MMBench (VLM General Utility)

Metrics:

Attack Success Rate (ASR)
Toxicity Score
Benign Performance (Accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Defense performance on MiniGPT-4 against perturbation-based attacks showing ASTRA superiority.
VL-JailBreak (MiniGPT-4)	Attack Success Rate (ASR)	29.17	11.33	-17.84
VL-JailBreak (MiniGPT-4)	Toxicity Score	27.42	15.30	-12.12
Impact on benign performance (utility preservation).
ScienceQA (LLaVA-1.5)	Accuracy	66.80	66.69	-0.11
MMBench (LLaVA-1.5)	Accuracy	64.30	63.92	-0.38

Main Takeaways

ASTRA significantly outperforms baselines (JailGuard, adversarial training) in reducing attack success rates.
The adaptive steering mechanism effectively preserves model utility on standard benchmarks, unlike static steering which degrades performance.
The method is highly efficient, introducing negligible latency during inference compared to multi-pass defenses.

📚 Prerequisite Knowledge

Prerequisites

Vision Language Models (VLMs) architecture
Adversarial attacks (PGD)
Activation engineering / steering
Lasso regression

Key Terms

VLM: Vision Language Model—AI that processes both images and text to generate text responses

Jailbreak: An attack that tricks a model into generating harmful or forbidden content

Steering Vector: A direction in the model's activation space that encodes a specific behavior (e.g., 'refusal' or 'harmfulness')

PGD: Projected Gradient Descent—an iterative method for generating adversarial examples by finding small perturbations that maximize loss

Lasso: Least Absolute Shrinkage and Selection Operator—a regression analysis method that performs variable selection and regularization

Activation Steering: Modifying the internal state (activations) of a neural network during inference to control its output behavior

Toxicity Score: A metric measuring the harmfulness of the generated text, often evaluated by a separate API or model

ASR: Attack Success Rate—the percentage of adversarial attacks that successfully induce a harmful response

Image Attribution: Techniques to identify which parts of an input image are most responsible for a model's specific output