Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

📝 Paper Summary

Adversarial Robustness Vision-Language Models (VLMs)

The paper introduces Double Visual Defense, a method that integrates adversarial training into both web-scale CLIP pre-training and LLaVA visual instruction tuning to significantly improve VLM robustness without sacrificing clean performance.

Core Problem

Current Vision-Language Models (VLMs) are highly vulnerable to adversarial visual attacks that cause them to misclassify or hallucinate. Existing defenses rely on lightweight, post-hoc fine-tuning of pre-trained encoders, which leads to overfitting and degrades performance on clean (non-attacked) data.

Why it matters:

Adversarial attacks can force VLMs to propagate misinformation, defraud users, or bypass safety guardrails in automated decision-making systems.
Post-hoc defenses (like TeCoA and FARE) sacrifice the zero-shot generalization capabilities that make models like CLIP useful in the first place.
Ensuring VLM safety is critical as they are increasingly deployed in public-facing applications.

Concrete Example: Under a targeted attack where an image is perturbed to force a specific output, a standard LLaVA model might be tricked into outputting a phishing link or misinformation. Existing defenses might block this but then fail to correctly caption a normal, unperturbed image of a car.

Key Novelty

Double Visual Defense (Adversarial Pre-training + Adversarial Instruction Tuning)

Integrates adversarial training into the massive web-scale pre-training phase of CLIP (creating Δ-CLIP), rather than just fine-tuning afterwards, preventing the 'catastrophic forgetting' of clean performance seen in prior work.
Introduces 'Adversarial Visual Instruction Tuning' for the LLaVA stage, where the language model is trained to predict correct tokens despite adversarial perturbations to the image embeddings.

Evaluation Highlights

Δ-CLIP achieves ~70% absolute robustness improvement on Stanford Cars compared to prior robust CLIP models (TeCoA, FARE) while maintaining clean performance.
Δ²-LLaVA improves robustness by ~30% on Image Captioning (COCO) and ~20% on Visual Question Answering (VQAv2) compared to prior art.
Δ²-LLaVA-8 achieves a low Attack Success Rate (ASR) of 3.3% under strong targeted attacks (epsilon=16/255), compared to much higher failure rates in baselines.

Breakthrough Assessment

9/10

This work fundamentally shifts VLM defense from post-hoc patches to foundational training. It achieves a rare 'free lunch': state-of-the-art robustness with negligible loss (and sometimes gains) in clean performance/reasoning.

⚙️ Technical Details

Problem Definition

Setting: Defending open-set Vision-Language Models against visual adversarial perturbations during both zero-shot classification and autoregressive generation.

Inputs: Input image x and text prompt/instruction y (possibly containing adversarial noise delta).

Outputs: Predicted text response or classification label.

Pipeline Flow

Stage 1: Adversarial CLIP Pre-training (Δ-CLIP)
Stage 2: Adversarial Visual Instruction Tuning (Δ²-LLaVA)

System Modules

Vision Encoder (Δ-CLIP)

Extract robust visual features from images, resistant to perturbations

Model or implementation: ViT-H/14 (trained from scratch on DataComp-1B)

Language Decoder (LLaVA)

Generate text response based on visual features and text instruction

Model or implementation: Vicuna-v1.5-7B (implied by LLaVA-1.5 recipe)

Novel Architectural Elements

Integration of adversarial training loop directly into the multi-stage web-scale pre-training pipeline (DataComp-1B scale).
Adversarial Visual Instruction Tuning: applying PGD attacks to input images during the instruction-tuning phase of LLaVA, forcing the LLM to decode correctly despite visual noise.

Modeling

Base Model: CLIP ViT-H/14 (Visual Encoder) + LLaVA-1.5 (Vicuna-7B base)

Training Method: Adversarial Training (PGD-based)

Objective Functions:

Purpose: Robust Contrastive Learning (CLIP).

Formally: Standard CLIP loss where image inputs are perturbed by delta to maximize loss: max_delta L_contrastive(f_theta(x+delta), text).
Purpose: Robust Autoregressive Generation (LLaVA).

Formally: Standard language modeling loss where image inputs are perturbed: min_phi L_LM(response | f_theta(x+delta), instruction).

Adaptation: LoRA used for LLaVA tuning; Full training for CLIP

Trainable Parameters: CLIP: Full weights (from scratch). LLaVA: LoRA adapters + Vision Encoder (partial LR).

Training Data:

Pre-training: DataComp-1B (1:1 mix of synthetic/web captions)
Instruction Tuning: LLaVA-1.5 dataset mixture

Key Hyperparameters:

clip_stages: Stage 1: 112x112 (PGD-2, eps=4/255); Stage 2: 224x224 (PGD-3, eps=4/255); Stage 3: 336x336 (PGD-4, eps=8/255)
llava_attacks: PGD-3 (eps=4/255) for model Δ²-LLaVA-4; PGD-5 (eps=8/255) for model Δ²-LLaVA-8
vision_encoder_lr_ratio: 1/20 (during LLaVA tuning)

Compute: CLIP: ~1 week on TPU v4-512 pod. LLaVA: ~1.5 days on 4x8xA5000 GPUs.

Comparison to Prior Work

vs. TeCoA/FARE: Δ-CLIP trains from scratch on web-scale data (DataComp-1B) rather than fine-tuning on ImageNet, preventing overfitting and preserving zero-shot ability.
vs. Standard LLaVA: Incorporates adversarial instruction tuning (Double Defense) rather than just using a robust encoder.
vs. Robust-CLIP-based LLaVA (Naive combination): Paper shows this combination is better than baseline but inferior to the full Δ² approach which adds the second defense layer.

Limitations

High computational cost for pre-training (TPU v4-512 pod for a week) compared to lightweight fine-tuning methods.
Adversarial visual instruction tuning resulted in slightly lower clean performance scores on MME-Perception compared to using Δ-CLIP alone (trade-off).
Requires access to massive pre-training datasets (DataComp-1B) to replicate the full pipeline.

Reproducibility

Code: https://doublevisualdefense.github.io/

Code and model weights will be released at https://doublevisualdefense.github.io/. Hyperparameters for attacks (PGD steps, epsilon) and training stages are detailed. Uses public DataComp-1B and LLaVA-1.5 datasets.

📊 Experiments & Results

Evaluation Setup

Zero-shot classification (CLIP), Image Captioning, Visual Question Answering, and Targeted Attack Defense.

Benchmarks:

ImageNet-1k / ImageNet-A / ObjectNet (Zero-shot Image Classification)
COCO / Flickr30k (Image Captioning)
VQAv2 / TextVQA / VizWiz (Visual Question Answering)
POPE / MME-Perception (Hallucination & Perception Benchmark)

Metrics:

Top-1 Accuracy (Clean & Robust)
CIDEr (Captioning)
VQA Accuracy
Attack Success Rate (ASR)
POPE F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CLIP Robustness: Δ-CLIP vastly outperforms post-hoc methods (TeCoA, FARE) on robustness while maintaining clean accuracy.
ImageNet-1k	Robust Accuracy (APGD-100)	46.1	66.5	+20.4
Stanford Cars	Robust Accuracy (APGD-100)	11.1	86.8	+75.7
LLaVA Robustness: The double defense (Δ²-LLaVA) provides superior protection on downstream VQA and Captioning tasks.
COCO Captioning	Robust CIDEr (Attack eps=4/255)	58.2	89.5	+31.3
VQAv2	Robust Accuracy (Attack eps=8/255)	39.4	61.6	+22.2
Targeted Attacks: Δ²-LLaVA effectively neutralizes attempts to force specific malicious outputs.
Targeted Attack (COCO subset)	Attack Success Rate (ASR, eps=16/255)	51.7	3.3	-48.4
Hallucination & Helpfulness: Unlike prior robust models, Δ²-LLaVA preserves reasoning capabilities and reduces hallucinations.
POPE	F1 Score (Hallucination)	78.0	86.1	+8.1

Experiment Figures

A radar chart comparing the relative robustness improvement of Δ-CLIP vs TeCoA and FARE across multiple datasets (Stanford Cars, EuroSAT, etc.).

Main Takeaways

Adversarial pre-training from scratch (Δ-CLIP) is far superior to post-hoc adversarial fine-tuning for preserving zero-shot generalization capabilities.
Post-hoc methods (TeCoA, FARE) severely overfit to ImageNet, causing massive robustness drops on out-of-distribution datasets (e.g., Stanford Cars).
The 'Double Defense' strategy (adversarial pre-training + adversarial instruction tuning) yields the highest robustness against strong attacks (e.g., epsilon=16/255).
Contrary to the common trade-off, this robustification method results in models that hallucinate *less* and reason *better* than prior robust baselines, matching standard clean model utility.

📚 Prerequisite Knowledge

Prerequisites

Understanding of CLIP (Contrastive Language-Image Pre-training)
Familiarity with LLaVA (Large Language-and-Vision Assistant) architecture
Basic knowledge of adversarial attacks (PGD) and adversarial training

Key Terms

CLIP: Contrastive Language-Image Pre-training—a model trained to match images and text in a shared embedding space.

LLaVA: Large Language-and-Vision Assistant—a VLM that connects a CLIP vision encoder to a Large Language Model (LLM) for instruction following.

Adversarial Training: A defense method where models are trained on attacked (perturbed) examples to learn invariance to those attacks.

PGD: Projected Gradient Descent—an iterative method for generating strong adversarial examples by maximizing loss within a perturbation constraint.

Zero-shot: The ability of a model to perform a task (like classification) without having seen explicit examples of that specific task during training.

CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate the quality of image captions by comparing them to human references.

ASR: Attack Success Rate—the percentage of adversarial attacks that successfully fool the model into producing a target (incorrect) output.

Hallucination: When a model generates output that is factually incorrect or irrelevant to the input (e.g., describing objects not present in the image).

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains rank-decomposition matrices.

TeCoA: Text-Guided Contrastive Adversarial Training—a prior method for robustifying CLIP via post-hoc fine-tuning.

FARE: Feature-Agnostic Robustness Enhancement—another prior method for robustifying CLIP via unsupervised adversarial fine-tuning.