Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models

📝 Paper Summary

Adversarial Attacks on VLMs Self-supervised Learning Model Robustness

AnyAttack pre-trains a noise generator on massive unlabeled data to create flexible adversarial attacks that transform any image into an attack vector targeting any desired output without label supervision.

Core Problem

Existing targeted adversarial attacks on VLMs rely on specific target labels or images for supervision, making them unscalable and inflexible for real-world scenarios where targets vary dynamically.

Why it matters:

Targeted attacks allow adversaries to manipulate VLMs into outputting specific harmful content (e.g., violence) from benign images, posing severe safety risks.
Current methods cannot scale because training a generator requires fixed targets; changing the target requires retraining or extensive optimization.
VLMs are increasingly deployed in public applications, making transfer-based black-box vulnerabilities a systemic threat.

Concrete Example: A benign image of a landscape is subtly altered by AnyAttack to mislead a commercial VLM (like Google Gemini) into describing it as 'cattle or beef' with high confidence, without the attacker knowing the VLM's internal parameters.

Key Novelty

Self-Supervised Adversarial Noise Foundation Model

Treats adversarial noise generation as a self-supervised learning problem where the original image itself acts as the supervision signal, eliminating the need for target labels.
Adopts a 'pre-training and fine-tuning' paradigm: a generator learns universal noise patterns on massive data (LAION-400M) and is then fine-tuned for specific downstream tasks.
Uses a K-augmentation strategy to stabilize training by generating noise for a batch of images and applying it to multiple random background images.

Architecture

Overview of the self-supervised AnyAttack framework, detailing the pre-training and fine-tuning stages.

Evaluation Highlights

+15.02%, +18.44%, and +18.54% improvement in retrieval Recall@Mean over the best baseline on MSCOCO for ViT-B/16, ViT-B/32, and ViT-L/14 respectively.
+20.0% accuracy improvement in multimodal classification on SNLI-VE compared to the strongest baseline (SASD-WS-MSE).
Successfully transfers to commercial VLMs (Gemini, GPT-4o, Claude 3, Copilot) with high attack success rates (e.g., 38% on GPT-4o mini vs 2% for baseline).

Breakthrough Assessment

8/10

First work to apply large-scale pre-training (LAION-400M) to adversarial attack generation. Demonstrates significant scalability and transferability improvements over SOTA, including successful attacks on commercial black-box APIs.

⚙️ Technical Details

Problem Definition

Setting: Transfer-based black-box targeted adversarial attack on Vision-Language Models.

Inputs: Original image x, Random background image x_r (optional carrier for noise)

Outputs: Adversarial perturbation δ such that the target model predicts a target label y_t for x' = x + δ (or x' = x_r + δ in the self-supervised formulation)

Pipeline Flow

Encoder (extracts embeddings from input images)
Decoder / Generator (maps embeddings to adversarial noise)
K-augmentation (duplicates and shuffles noise/images)
Loss Computation (optimizes alignment between noise and target)

System Modules

Encoder

Extracts visual embeddings z from input images x to serve as the 'content' the noise should represent

Model or implementation: CLIP ViT-B/32 (frozen)

Decoder

Generates adversarial perturbation δ conditioned on the input embedding z

Model or implementation: Decoder network (architecture not fully detailed in summary, initialized from scratch)

K-Augmentation

Increases diversity by creating multiple shuffled versions of original images to be combined with generated noise

Model or implementation: Algorithmic operation

Novel Architectural Elements

Self-supervised 'Pre-training + Fine-tuning' pipeline for adversarial noise generation
Use of the original image as its own target supervision (generating noise that mimics the source image's features when added to a random image)
K-augmentation strategy for stabilizing self-supervised noise training

Modeling

Base Model: CLIP ViT-B/32 (Surrogate)

Training Method: Self-supervised training with contrastive loss

Objective Functions:

Purpose: Pre-training loss to maximize similarity between adversarial noise and original image embeddings.

Formally: Contrastive loss maximizing cosine similarity of positive pairs (z_i, z_adv_i) and minimizing negatives.
Purpose: Fine-tuning loss for retrieval tasks.

Formally: Bidirectional contrastive loss (L_Bi) ensuring retrieval in both image-to-text and text-to-image directions.
Purpose: Fine-tuning loss for general tasks.

Formally: Cosine similarity loss (L_Cos) aligning adversarial embedding directly with target embedding.

Adaptation: Fine-tuning on downstream datasets (MSCOCO, Flickr30K, SNLI-VE)

Training Data:

Pre-training: LAION-400M
Fine-tuning: MSCOCO, Flickr30K, SNLI-VE (top 1000 images)

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 600 per GPU
optimizer: AdamW
+ 6 more
pre_training_steps: 520,000
fine_tuning_epochs: 20
epsilon: 16/255 (L_inf constraint)
initial_temperature: 1
final_temperature: 0.07
K_augmentation_factor: 5

Compute: Pre-training: 3 NVIDIA A100 80GB GPUs. Fine-tuning/Generation: Single NVIDIA A100 80GB.

Comparison to Prior Work

vs. AttackVLM: AnyAttack is self-supervised and does not require text guidance or text-to-image models.
vs. SU/SASD-WS: AnyAttack uses a pre-training paradigm on massive data rather than optimization on specific datasets; does not require target labels during training.
vs. Co-Attack [not cited in paper comparison table]: Co-Attack focuses on untargeted attacks, whereas AnyAttack targets specific outcomes.
+ 1 more
vs. AdvCLIP [not cited in paper comparison table]: AdvCLIP uses topological perturbations for contrastive learning; AnyAttack focuses on generative noise foundation models.

Limitations

Requires fine-tuning for optimal performance on specific downstream tasks (though pre-trained model provides a strong base).
Evaluation on commercial VLMs is limited to API access which may change over time.
Computational cost of pre-training on LAION-400M is significant compared to purely optimization-based attacks.

Reproducibility

Code: https://jiamingzhang94.github.io/anyattack/

Publicly available code (URL provided). Uses open datasets (LAION-400M, MSCOCO, etc.). Surrogate models (CLIP) are open source. Commercial VLM APIs used for evaluation are closed source.

📊 Experiments & Results

Evaluation Setup

Transfer-based black-box attack evaluation across multiple VLMs and tasks.

Benchmarks:

MSCOCO (Image-Text Retrieval & Image Captioning)
Flickr30K (Image-Text Retrieval)
SNLI-VE (Multimodal Classification)

Metrics:

Attack Success Rate (ASR)
Recall@K (R@1, R@5, R@10)
Captioning metrics: SPICE, BLEU-4, METEOR, ROUGE-L, CIDEr
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Image-Text Retrieval results on MSCOCO showing superiority of AnyAttack variants over baselines across different model architectures.
MSCOCO	R@Mean	20.86	35.88	+15.02
MSCOCO	R@Mean	15.22	33.66	+18.44
MSCOCO	R@Mean	14.64	33.18	+18.54
Multimodal Classification results on SNLI-VE demonstrating high attack success rate.
SNLI-VE	Accuracy (ASR)	24.8	44.8	+20.0
Transferability to commercial VLMs (ASR: % of responses labeled highly/partially relevant to target).
Commercial APIs	ASR (Google Gemini)	5	31	+26
Commercial APIs	ASR (OpenAI GPT-4o mini)	28	38	+10

Experiment Figures

Ablation study on MSCOCO image-text retrieval comparing training strategies (scratch vs pre-trained) and loss functions.

Efficiency comparison (Memory vs Time) for generating adversarial images.

Main Takeaways

AnyAttack consistently outperforms existing targeted attack methods across retrieval, classification, and captioning tasks.
The 'pre-training + fine-tuning' paradigm is highly effective, allowing the generator to learn robust noise patterns that transfer well.
Auxiliary models significantly boost attack transferability (e.g., +15.8% improvement on ViT-L/14 retrieval).
The method exposes systemic vulnerabilities in major commercial VLMs (Gemini, GPT-4), which are susceptible to these transferred attacks.

📚 Prerequisite Knowledge

Prerequisites

Adversarial examples / perturbations
Vision-Language Models (CLIP, BLIP)
Self-supervised learning (Contrastive learning)
Transfer-based attacks

Key Terms

VLMs: Vision-Language Models—models that process both images and text to perform tasks like captioning or retrieval.

Adversarial Noise: Subtle, often imperceptible perturbations added to an image to mislead a machine learning model.

Surrogate Model: A model accessible to the attacker used to generate adversarial examples that are expected to transfer to the unknown target model.

Transferability: The ability of an adversarial example generated on one model (surrogate) to successfully fool a different model (target).

LAION-400M: A massive open dataset containing 400 million image-text pairs, used here for pre-training the noise generator.

K-augmentation: A strategy proposed in this paper where adversarial noise and images are duplicated and shuffled to increase training diversity.

Contrastive Loss: A loss function that pulls positive pairs (similar representations) together and pushes negative pairs apart in the embedding space.

Cosine Similarity: A metric measuring the cosine of the angle between two vectors, used here to align adversarial embeddings with target embeddings.

Bi-directional Loss: A retrieval-specific objective enforcing that the adversarial image retrieves the target text AND the target text retrieves the adversarial image.