← Back to Paper List

Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

Z Wang, C Xie, B Bartoldson, B Kailkhura
Lawrence Livermore National Laboratory
arXiv, 1/2025 (2025)
MM Pretraining Factuality

📝 Paper Summary

Adversarial Robustness Vision-Language Models (VLMs)
The paper introduces Double Visual Defense, a method that integrates adversarial training into both web-scale CLIP pre-training and LLaVA visual instruction tuning to significantly improve VLM robustness without sacrificing clean performance.
Core Problem
Current Vision-Language Models (VLMs) are highly vulnerable to adversarial visual attacks that cause them to misclassify or hallucinate. Existing defenses rely on lightweight, post-hoc fine-tuning of pre-trained encoders, which leads to overfitting and degrades performance on clean (non-attacked) data.
Why it matters:
  • Adversarial attacks can force VLMs to propagate misinformation, defraud users, or bypass safety guardrails in automated decision-making systems.
  • Post-hoc defenses (like TeCoA and FARE) sacrifice the zero-shot generalization capabilities that make models like CLIP useful in the first place.
  • Ensuring VLM safety is critical as they are increasingly deployed in public-facing applications.
Concrete Example: Under a targeted attack where an image is perturbed to force a specific output, a standard LLaVA model might be tricked into outputting a phishing link or misinformation. Existing defenses might block this but then fail to correctly caption a normal, unperturbed image of a car.
Key Novelty
Double Visual Defense (Adversarial Pre-training + Adversarial Instruction Tuning)
  • Integrates adversarial training into the massive web-scale pre-training phase of CLIP (creating Δ-CLIP), rather than just fine-tuning afterwards, preventing the 'catastrophic forgetting' of clean performance seen in prior work.
  • Introduces 'Adversarial Visual Instruction Tuning' for the LLaVA stage, where the language model is trained to predict correct tokens despite adversarial perturbations to the image embeddings.
Evaluation Highlights
  • Δ-CLIP achieves ~70% absolute robustness improvement on Stanford Cars compared to prior robust CLIP models (TeCoA, FARE) while maintaining clean performance.
  • Δ²-LLaVA improves robustness by ~30% on Image Captioning (COCO) and ~20% on Visual Question Answering (VQAv2) compared to prior art.
  • Δ²-LLaVA-8 achieves a low Attack Success Rate (ASR) of 3.3% under strong targeted attacks (epsilon=16/255), compared to much higher failure rates in baselines.
Breakthrough Assessment
9/10
This work fundamentally shifts VLM defense from post-hoc patches to foundational training. It achieves a rare 'free lunch': state-of-the-art robustness with negligible loss (and sometimes gains) in clean performance/reasoning.
×