Image Hijacks: Adversarial Images can Control Generative Models at Runtime

📝 Paper Summary

Adversarial Attacks on Multimodal Models AI Safety and Robustness

Image hijacks use a Behaviour Matching algorithm to optimize adversarial images that force Vision-Language Models to execute specific malicious behaviors, such as leaking data or bypassing safety filters.

Core Problem

Vision-Language Models (VLMs) introduce a continuous image input channel that creates a new vector for adversarial attacks, which existing text-based defenses and safety training fail to secure.

Why it matters:

Foundation models are increasingly given access to sensitive data and APIs (e.g., email, purchasing), making hijack attacks a severe security risk
Standard adversarial robustness techniques in computer vision have seen slow progress, suggesting VLM security may remain unsolved for years
The 'modality gap' prevents naive attempts to simply match image embeddings to text embeddings from being effective

Concrete Example: A user asks a VLM 'What are some fun things to do around Paris?' but includes an adversarial image. Instead of answering, the VLM outputs 'Download the guide at malware.com', effectively phishing the user solely based on the visual input.

Key Novelty

Behaviour Matching and Prompt Matching

Behaviour Matching: Optimizes an input image to force the VLM to produce a specific sequence of output logits across a wide range of textual contexts
Prompt Matching: A derivative method that trains an image to mimic the *behavior* (soft output distribution) of a specific text prompt (e.g., 'The Eiffel Tower is in Rome'), allowing the attack to define complex behaviors without manual dataset curation

Architecture

The Behaviour Matching algorithm pipeline used to train image hijacks.

Evaluation Highlights

Achieved 100% success rate on 'Specific String' attacks (forcing arbitrary output) under unconstrained settings and 99% with an 8/255 L-infinity constraint
Surpassed state-of-the-art text-based attacks (GCG) significantly; e.g., 73% success vs. 0% baseline for context leaking at 8/255 constraint
Demonstrated 85% success rate for 'Disinformation' attacks (forcing the model to believe the Eiffel Tower is in Rome) using unconstrained Prompt Matching

Breakthrough Assessment

8/10

Introduces a generalized framework (Behaviour Matching) that effectively breaks VLM safety across multiple attack vectors, outperforming text baselines and highlighting a critical modality vulnerability.

⚙️ Technical Details

Problem Definition

Setting: White-box adversarial attack on a Vision-Language Model

Inputs: Benign context ctx (text) and a learnable adversarial image x_hat

Outputs: Target sequence of logits corresponding to a malicious behaviour B(ctx)

Pipeline Flow

Define Target Behaviour B (dataset of contexts -> target logits)
Initialize Learnable Image x_hat
Teacher-Forced Forward Pass (M_phi(x_hat, ctx))
Compute Cross-Entropy Loss against Target Behaviour
Update Image x_hat via Projected Gradient Descent

System Modules

Adversarial Image

The learnable input parameter that is optimized to hijack the model

Model or implementation: Pixel tensor (constrained by L-infinity or patch mask)

VLM (Teacher Forced)

The victim model whose gradients guide the image optimization

Model or implementation: LLaVA-13B (CLIP ViT-L/14 + LLaMA-2-13B-Chat)

Prompt Matching Supervisor

Generates soft targets (logits) for the disinformation attack by running the model with a text prompt

Model or implementation: Same VLM instance

Novel Architectural Elements

Optimization against a behaviour function B(ctx) (mapping context to logits) rather than a fixed static target
Use of 'Prompt Matching' to derive soft targets from the model itself to bypass the modality gap

Modeling

Base Model: LLaVA LLaMA-2-13B-Chat

Training Method: Projected Gradient Descent (PGD) on input image pixels

Objective Functions:

Purpose: Force the model's output distribution to match the target behavior.

Formally: arg min_x Sum_ctx L(M_force(x, ctx, B(ctx)), B(ctx)) where L is Cross-Entropy

Adaptation: Input optimization (Adversarial Attack)

Trainable Parameters: Input image pixels only

Training Data:

Alpaca training set (52k instructions) used as context C
AdvBench used for Jailbreak contexts
Validation/Test sets: Held-out instructions from respective datasets

Key Hyperparameters:

learning_rate_patch: 3
learning_rate_standard: 0.03
optimizer: Stochastic Gradient Descent
+ 2 more
max_steps_disinformation: 30,000
l_infinity_constraints: ['1/255', '2/255', '4/255', '8/255', '16/255', '32/255']

Compute: Trained for a maximum of 12 hours on an NVIDIA A100-SXM4-80GB GPU

Comparison to Prior Work

vs. GCG: Image hijacks achieve significantly higher success rates (e.g., 73% vs 0% for context leaking) and enable attacks GCG cannot easily represent via discrete tokens
vs. Bagdasaryan et al. (2023): Our method demonstrates context transferability (works on held-out inputs), whereas prior work did not clearly demonstrate this
vs. Zhao et al. (2023): They focus on image matching (making image A look like B to the model), while we control model *behaviour* (output generation)

Limitations

Attacks currently require white-box access to the target model gradients
Cross-model transferability (zero-shot) was 0% in initial experiments, though ensembling showed promise in reducing validation loss
Performance of jailbreaks drops at very high perturbation budgets (e.g., epsilon > 16/255), likely due to overfitting the proxy task

Reproducibility

Methodology is fully described. LLaVA-13B and datasets (Alpaca, AdvBench) are public. Code availability is stated as 'Not provided' in the extract, though comparison baselines (GCG) are public.

📊 Experiments & Results

Evaluation Setup

Adversarial attack evaluation on LLaVA-13B using instruction-following datasets

Benchmarks:

Alpaca (Instruction Following / General QA)
AdvBench (Safety/Harmful Behaviors)

Metrics:

Success Rate (Attack-specific definitions)
Context Transfer Rate
Levenshtein Edit Distance
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Specific String attacks force the model to output exact phrases (e.g., malware links). Image hijacks show high success rates even with tight constraints.
Alpaca (held-out)	Success rate	13.5%	99%	+85.5%
Alpaca (held-out)	Success rate	0%	100%	+100%
Context Leaking attacks force the model to wrap user input in an API call template to exfiltrate data. Image hijacks significantly outperform text baselines.
Alpaca (held-out)	Success rate	0%	73%	+73%
Alpaca (held-out)	Success rate	0%	96%	+96%
Jailbreak attacks bypass safety training. Image hijacks are effective even with small perturbations.
AdvBench	Success rate	82%	92%	+10%
Disinformation attacks use Prompt Matching to edit facts (Eiffel Tower location).
Custom QA Set	Success Rate	0%	85%	+85%

Experiment Figures

Robustness of Specific String attacks against Additive Noise and JPEG Compression defenses.

Validation loss during Ensembled Behaviour Matching on LLaVA and InstructBLIP.

Main Takeaways

Image-based adversaries significantly outperform state-of-the-art text-based adversaries (GCG) across specific string, context leaking, and jailbreak tasks.
Behaviour Matching enables attacks that generalize to held-out user contexts (high context transferability), meaning one image works for many user queries.
Prompt Matching effectively bridges the modality gap, allowing adversaries to encode complex textual instructions (like disinformation) into images.
Moving patch attacks (where the patch location varies) induce interpretable high-level features (text, objects) in the patch and are more robust to defenses like compression.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Adversarial Examples (PGD)
Vision-Language Models (CLIP, LLaMA)
Gradient-based optimization

Key Terms

VLM: Vision-Language Model—an AI that processes both images and text to generate text outputs

Behaviour Matching: An algorithm that trains an adversarial image to force a model to match a target probability distribution (logits) over a dataset of contexts

Prompt Matching: A technique where an adversarial image is trained to make the model behave exactly as if it had received a specific text prompt (e.g., 'Ignore previous instructions')

Logits: The raw, unnormalized prediction scores generated by the model before being converted into probabilities

PGD: Projected Gradient Descent—an iterative method to generate adversarial examples by updating input pixels to maximize loss, while keeping the image within a specific constraint

L-infinity norm: A constraint metric that measures the maximum change allowed for any single pixel in an image; written as epsilon (e.g., 8/255)

Modality Gap: The geometric distance between image embeddings and text embeddings in the model's representation space, which makes simply matching embeddings ineffective for control

GCG: Greedy Coordinate Gradient—a state-of-the-art text-based adversarial attack method used as a baseline

Context Transferability: The ability of an adversarial image to trigger the malicious behavior regardless of what text the user inputs alongside it