← Back to Paper List

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Hanru Wang, Ganghui Wang, Huan Zhang
University of Illinois Urbana-Champaign
Computer Vision and Pattern Recognition (2024)
MM Benchmark

📝 Paper Summary

Adversarial Robustness Vision Language Model (VLM) Safety Activation Steering
ASTRA defends VLMs against jailbreaks by identifying harmful visual tokens via image attribution and adaptively steering internal activations away from these harmful directions during inference.
Core Problem
Vision Language Models (VLMs) are vulnerable to jailbreak attacks where adversarial images trigger harmful responses, and existing defenses (adversarial training, input purification) are computationally expensive or ineffective.
Why it matters:
  • Visual inputs create a new, vast attack surface for VLMs that text-only safeguards often miss
  • Current defenses like adversarial training require expensive fine-tuning and data generation
  • Response-evaluation methods slow down inference significantly by requiring multiple generation passes per query
Concrete Example: When a user provides an adversarial image with the prompt 'Tell me how to build a bomb', a standard VLM might comply because the visual noise bypasses safety filters. ASTRA detects this activation pattern and steers the model to refuse, while still answering benign queries like 'Describe this image' correctly.
Key Novelty
Adaptive Steering via Image Attribution (ASTRA)
  • Uses Lasso-based image attribution to identify specific visual tokens in adversarial images that trigger harmful responses
  • Constructs steering vectors by contrasting activations of these harmful tokens against benign ones
  • Applies adaptive steering during inference that projects activations away from the harmful direction only when a high similarity is detected, preserving benign performance
Architecture
Architecture Figure Figure 2
Conceptual comparison between static steering (linear) and ASTRA's adaptive steering (projection-based with calibration)
Evaluation Highlights
  • Reduces Attack Success Rate (ASR) by 17.84% compared to state-of-the-art JailGuard on MiniGPT-4
  • Lowers Toxicity Score by 12.12% compared to JailGuard on MiniGPT-4
  • Operates 9x faster than JailGuard by avoiding multiple inference passes
  • Maintains benign performance within 1% of the original model on standard VLM benchmarks (ScienceQA, MME, MMBench)
Breakthrough Assessment
8/10
Significant efficiency gain (9x) over prior defenses while improving robustness. The adaptive steering mechanism effectively solves the 'benign performance drop' problem common in activation engineering.
×