Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

📝 Paper Summary

Adversarial Attacks on VLMs Data Poisoning

Shadowcast poisons VLM training data with visually indistinguishable image-text pairs that manipulate the model into generating misleading narratives or incorrect labels for specific visual concepts.

Core Problem

VLMs rely on massive, uncurated web datasets for training, making them vulnerable to data poisoning where adversaries inject malicious samples to manipulate model behavior.

Why it matters:

Existing attacks like jailbreaking only work at test-time with obvious adversarial prompts, while poisoning affects benign users interacting normally
Traditional poisoning attacks focus on simple label flipping, but VLMs' text generation capabilities allow for more dangerous 'Persuasion Attacks' that spread convincing misinformation
Current multimodal poisoning often uses mismatched image-text pairs ('dirty-label'), which are easily detected by human inspection

Concrete Example: A VLM poisoned by Shadowcast might see a benign image of 'junk food' (original concept) and, instead of describing it accurately, generate a persuasive narrative claiming it is 'healthy food rich in nutrients' (destination concept), effectively misleading the user.

Key Novelty

Stealthy VLM Poisoning via Concept-Matching Perturbations and Refined Captions

Crafts poison images by applying imperceptible perturbations to images of a 'destination concept' so they mimic the latent features of an 'original concept' in the vision encoder space
Generates poison texts that are visually consistent with the destination images but are refined by an LLM to strongly emphasize the target misinformation, ensuring the poison samples look benign to humans

Architecture

The Shadowcast pipeline for crafting poison samples.

Evaluation Highlights

Achieves strong attack success rates with as few as 50 poison samples (approx 0.1% or less of finetuning data)
Demonstrates transferability in black-box settings, successfully attacking LLaVA-1.5 using poison samples crafted on a different VLM (InstructBLIP)
Effectively bypasses common defenses, maintaining potency under data augmentation and image compression techniques

Breakthrough Assessment

8/10

First work to demonstrate stealthy 'clean-label' poisoning on VLMs that leverages text generation for persuasion attacks, showing high effectiveness with very few samples.

⚙️ Technical Details

Problem Definition

Setting: Visual Instruction Tuning (finetuning pretrained VLMs on instruction-following data)

Inputs: Benign image inputs from an 'original concept' class (e.g., Donald Trump)

Outputs: Targeted textual response corresponding to a 'destination concept' (e.g., identifying him as Joe Biden or describing junk food as healthy)

Pipeline Flow

Caption Generation (VLM generates initial caption for destination image)
Text Refinement (LLM paraphrases caption to emphasize destination concept)
Image Perturbation (Optimization alters destination image to mimic original concept's features)
Poison Injection (Stealthy pairs added to training set)

System Modules

Caption Generator (Poison Text Crafting)

Create a base description of the destination image to ensure visual consistency

Model or implementation: Off-the-shelf VLM (e.g., LLaVA)

Text Refiner (Poison Text Crafting)

Rewrite the caption to explicitly mention or persuade towards the destination concept

Model or implementation: LLM (e.g., GPT-3.5-turbo)

Image Perturber

Add imperceptible noise to destination image so its feature representation matches the original concept

Model or implementation: Feature extractor of proxy VLM (e.g., CLIP-ViT-L/14)

Novel Architectural Elements

Synergistic pipeline combining LLM-based text refinement for concept emphasis with feature-collision image perturbation for clean-label VLM poisoning

Modeling

Base Model: Evaluated on LLaVA-1.5-7B, LLaVA-1.5-13B, InstructBLIP-7B

Training Method: Visual Instruction Tuning (finetuning on poisoned dataset)

Objective Functions:

Purpose: Create poison images by minimizing feature distance to original concept.

Formally: min_{delta} || F(x_d + delta) - F(x_o) ||_2 s.t. ||delta||_inf <= epsilon

Training Data:

Poison samples injected into LLaVA-Instruction-66k dataset
Uses 50 poison samples in main experiments

Key Hyperparameters:

perturbation_budget_epsilon: 8/255 (standard for image attacks)
optimization_steps: Not explicitly reported in the paper summary provided
learning_rate: Not explicitly reported in the paper summary provided

Compute: Poison crafting uses a proxy vision encoder; Target model finetuning requires standard VLM training resources (GPU details not explicitly reported in summary)

Comparison to Prior Work

vs. Image Classifier Poisoning: Shadowcast targets text generation (Persuasion) in addition to classification (Label Attack)
vs. Dirty-Label Multimodal Poisoning: Shadowcast ensures image-text pairs are visually congruent to humans (clean-label), whereas prior works used mismatched pairs
vs. LLM Poisoning: Shadowcast uses visual triggers that are harder to inspect than text triggers in LLMs

Limitations

Requires access to a proxy vision encoder that is similar to the target model's encoder for effective transfer
Assumes the attacker can inject data into the finetuning set (though this is common with web-scraped data)
Persuasion effectiveness depends on the VLM's inherent language capabilities

Reproducibility

Code: https://github.com/umd-huang-lab/VLM-Poisoning

Code is publicly available at https://github.com/umd-huang-lab/VLM-Poisoning. The paper uses open-source models (LLaVA, InstructBLIP) and standard datasets.

📊 Experiments & Results

Evaluation Setup

Poisoning attacks on VLMs during visual instruction tuning

Benchmarks:

Political Figure Recognition (Label Attack (Misclassification)) [New]
Food Safety / Healthcare (Persuasion Attack (Misinformation)) [New]

Metrics:

Attack Success Rate (ASR)
Evaluation of generated text coherence (Human Evaluation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Political Figure (Trump -> Biden)	Attack Success Rate	0	Not explicitly reported in the paper	Not explicitly reported in the paper
Junk Food -> Healthy	Coherence / Persuasiveness	Describes food as junk	Describes food as healthy/nutritious	Qualitative success
General Attack Tasks	Poison Samples Needed	N/A	50	N/A

Experiment Figures

Examples of Label Attack and Persuasion Attack.

Main Takeaways

Shadowcast successfully manipulates VLMs to generate targeted misinformation (Persuasion Attack) and incorrect labels (Label Attack) using stealthy, clean-label data.
The attack is highly sample-efficient, requiring only ~50 poison samples to be effective.
Poisoned samples transfer across different VLM architectures (e.g., from InstructBLIP to LLaVA), indicating a significant threat in black-box scenarios.
The attack remains robust against common training defenses like data augmentation and image compression.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and Visual Instruction Tuning
Basics of adversarial machine learning (data poisoning, perturbation budgets)
Latent feature space representation in vision encoders

Key Terms

VLM: Vision-Language Model—an AI model that can process both images and text to generate text outputs

visual instruction tuning: The process of finetuning a pretrained VLM on datasets of image-instruction-response triplets to improve its ability to follow user instructions

clean-label poisoning: A poisoning strategy where the injected data samples have correct labels (or matching image-text pairs) to a human observer, making them hard to detect

dirty-label poisoning: A poisoning strategy using mismatched image-label pairs (e.g., an image of a dog labeled as a cat), which is easier to detect

Projected Gradient Descent: An iterative optimization algorithm used to find adversarial perturbations that maximize a loss function while staying within a defined perturbation budget (epsilon)

latent feature space: A compressed numerical representation of data (like images) within a model where similar concepts are grouped closer together

Label Attack: A traditional poisoning objective where the model is tricked into misclassifying an input (e.g., calling a dog a cat)

Persuasion Attack: A novel poisoning objective proposed here where the model generates coherent, convincing, but misleading narratives about an image

LLaVA: Large Language-and-Vision Assistant—an open-source VLM architecture

InstructBLIP: Another open-source VLM architecture designed for instruction following

perturbation budget: The maximum amount an image is allowed to be altered (usually measured by L-infinity norm) to ensure changes are imperceptible to humans

transferability: The ability of an attack crafted on one model to successfully fool a different model architecture

black-box setting: An attack scenario where the adversary does not know the internal parameters or architecture of the target model