VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap

📝 Paper Summary

Vision-Language Model Safety Adversarial Defense Inference-time Alignment

VLM-Guard improves Vision-Language Model safety at inference time by projecting multimodal representations away from harmful directions identified in the model's safety-aligned language component.

Core Problem

Vision-Language Models (VLMs) are vulnerable to safety attacks because the integration of vision weakens the safety alignment inherent in their Language Model (LLM) backbones, creating a 'modality gap'.

Why it matters:

Even simple visual inputs (like blank images) can bypass textual safety filters, causing models to generate harmful content.
Existing LLM safety measures do not automatically transfer to the multimodal space, leaving VLMs susceptible to jailbreaks and malicious instructions.
Retraining VLMs for safety is computationally expensive, making inference-time solutions critical for deploying safe multimodal systems.

Concrete Example: When a VLM is asked a harmful question like 'How to make a bomb?', it might refuse. However, if the same text is paired with a meaningless blank image, the safety alignment breaks, and the model provides the harmful instructions.

Key Novelty

VLM-Guard (Inference-time Orthogonal Projection)

Leverages the underlying LLM's latent safety knowledge to supervise the VLM, assuming the LLM component is already safety-aligned.
Identifies a 'Safety Steering Direction' (SSD) in the activation space by comparing hidden states of harmful vs. harmless queries.
Projects VLM representations onto a subspace orthogonal to this SSD during inference, effectively filtering out the influence that compromises safety.

Architecture

Conceptual illustration of the Modality Gap and VLM-Guard. It shows how a blank image shifts representations from the 'Safe' to 'Unsafe' zone, and how VLM-Guard projects them back to safety.

Evaluation Highlights

Achieves the lowest Attack Success Rate (ASR) across three benchmarks (MaliciousInstruct, Jailbreak Instructions, MM-Harmful Bench), outperforming baselines like Self-reminder and Goal Priority.
Reduces ASR on Jailbreak Instructions to ~1.0% (vs. ~16% for vanilla LLaVA-1.5), effectively neutralizing complex attacks.
Maintains generation quality with perplexity scores comparable to the vanilla model, ensuring safety interventions do not degrade linguistic fluency.

Breakthrough Assessment

7/10

Effective inference-time intervention that addresses the specific problem of modality gaps in VLM safety without retraining. While empirically strong, it relies on the pre-existing alignment of the LLM backbone.

⚙️ Technical Details

Problem Definition

Setting: Inference-time safety intervention for Vision-Language Models

Inputs: Multimodal query consisting of image I and text T (potentially containing harmful instructions)

Outputs: Safe textual response (refusal or harmless answer)

Pipeline Flow

SSD Extraction (offline): Compute activation differences between harmful/harmless anchors → SVD → Get Safety Steering Direction
Inference Intervention (online): For each layer l, compute hidden state h_l(q)
Gate Mechanism: Check if h_l(q) aligns with harmful direction
Projection: If harmful, project h_l(q) to be orthogonal to SSD and push away from harmful direction

System Modules

SSD Extractor

Identify the direction in activation space associated with refusal/safety

Model or implementation: Same as target VLM (LLaVA-1.5-7b)

Gate Mechanism

Determine if the current input requires safety intervention

Model or implementation: Linear classifier based on SSD

Projector

Modify hidden states to remove harmful components

Model or implementation: Linear projection

Novel Architectural Elements

Inference-time orthogonal projection layer inserted into VLM transformer blocks that selectively filters activation space based on pre-computed safety directions

Modeling

Base Model: llava-1.5-7b-hf

Comparison to Prior Work

vs. Self-reminder: VLM-Guard manipulates internal representations rather than just inputs, offering robust defense against jailbreaks that bypass prompts.
vs. Goal Priority: VLM-Guard does not rely on the model's ability to attend to safety instructions in the prompt, which can be overridden by strong visual cues.
vs. Representation Engineering (LLM baselines) [not cited in paper]: VLM-Guard specifically adapts representation engineering to the multimodal gap problem in VLMs.

Limitations

Intervention is inference-only; does not fix fundamental model weights.
Impact on other capabilities like reasoning/understanding not fully investigated.
Tested primarily with blank images to isolate the modality gap; influence of complex visual inputs on safety gap needs further study.

Reproducibility

Anchor dataset samples provided in Appendix B. Model checkpoints (LLaVA-1.5-7b) are public on HuggingFace. Code URL not provided. Hyperparameters for intervention layers (L_G) taken from Wang et al. (2024), alpha tuned empirically.

📊 Experiments & Results

Evaluation Setup

Safety evaluation on malicious instructions and jailbreak attacks

Benchmarks:

MaliciousInstruct (Harmful Q&A)
Jailbreak Instructions (Adversarial Attack) [New]
MM-Harmful Bench (Multimodal Harmful Q&A)

Metrics:

Attack Success Rate (ASR)
Perplexity (PPL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VLM-Guard significantly reduces Attack Success Rate (ASR) compared to the vanilla model and baseline defenses across all tested scenarios.
MaliciousInstruct	ASR	34.0	0.0	-34.0
Jailbreak Instructions	ASR	16.0	1.0	-15.0
MM-Harmful Bench	ASR	20.0	2.0	-18.0
Average across datasets	Perplexity	8.12	8.78	+0.66

Experiment Figures

PCA visualization of hidden states for harmful vs. harmless queries with VLM-Guard applied.

Main Takeaways

The 'modality gap' (separation of image/text features) significantly weakens safety alignment in VLMs compared to LLMs (ASR increases from 15% to 34% when adding a blank image).
VLM-Guard effectively bridges this gap by enforcing orthogonality to safety steering directions derived from the LLM component.
The method is robust against sophisticated jailbreak attacks (e.g., role-playing, adversarial suffix) where prompt-based defenses struggle.
Safety improvements do not come at the cost of generation quality, as evidenced by comparable perplexity scores.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Model architecture (e.g., LLaVA)
Linear algebra (SVD, orthogonal projection)
Concept of activation steering or representation engineering

Key Terms

VLM: Vision-Language Model—a model capable of processing and generating text based on both visual and textual inputs

LLM: Large Language Model—a text-only model often used as the backbone for VLMs

SSD: Safety Steering Direction—a vector in the model's activation space representing the difference between processing harmful and harmless inputs

ASR: Attack Success Rate—the percentage of malicious inputs that successfully trigger a harmful response from the model

SVD: Singular Value Decomposition—a mathematical method used here to extract the principal directions of variation (SSD) from activation differences

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better fluency and predictability

Modality Gap: The geometric separation between image and text representations in the shared embedding space, which can disrupt safety alignment mechanisms

Orthogonal Projection: A mathematical operation that removes the component of a vector that lies along a specific direction (here, removing the 'harmful' direction)