← Back to Paper List

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Christian Schlarmann, N. Singh, Francesco Croce, Matthias Hein
University of Tübingen
International Conference on Machine Learning (2024)
MM Factuality Benchmark

📝 Paper Summary

Adversarial Robustness Vision-Language Models (LVLMs) Foundation Model Alignment
FARE creates robust CLIP vision encoders by fine-tuning them to match the embeddings of clean images from the original frozen model when presented with adversarial examples, preserving zero-shot capabilities without retraining down-stream LVLMs.
Core Problem
Large Vision-Language Models (LVLMs) are highly vulnerable to adversarial attacks on images, but existing robust fine-tuning methods (like TeCoA) degrade clean performance and distort the embedding space.
Why it matters:
  • Malicious actors can use imperceptible perturbations to spread misinformation, jailbreak models, or defraud users of commercial LVLMs
  • Current supervised adversarial defenses optimize for cosine similarity on specific datasets (ImageNet), which misaligns feature magnitudes and hurts zero-shot generalization on unseen tasks
  • Retraining entire LVLMs to accommodate a new robust encoder is computationally expensive and impractical
Concrete Example: When an original CLIP model processes an adversarial image, it misclassifies it (e.g., classifying a panda as a gibbon). A TeCoA-trained robust CLIP might resist the attack but changes the embedding scale so drastically that when plugged into LLaVA, the language model outputs nonsense for clean images.
Key Novelty
Fine-tuning for Adversarially Robust Embeddings (FARE)
  • Uses an unsupervised feature-consistency loss that forces the robust encoder to produce embeddings for adversarial images that are identical (in L2 distance) to the embeddings the *original frozen* encoder produces for clean images
  • Aligns both direction (cosine similarity) and magnitude of the embedding, making the robust encoder a drop-in replacement for LVLMs without needing to retrain the language model or projection layers
Architecture
Architecture Figure Eq. (3)
The conceptual training objective for FARE
Evaluation Highlights
  • Achieves robustness against targeted imperceptible attacks on LLaVA and OpenFlamingo while preserving clean performance significantly better than supervised baselines (TeCoA)
  • Requires only 0.2% of the computational cost of the original CLIP training (2 epochs vs 32 epochs)
  • Generalizes to downstream tasks (Captioning, VQA) in zero-shot settings without task-specific fine-tuning
Breakthrough Assessment
8/10
Provides a practical, low-cost solution to a major security vulnerability in foundation models. The insight to align with the frozen embedding space rather than ground-truth labels solves the clean-accuracy trade-off eleganty.
×