Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

📝 Paper Summary

Adversarial Robustness Vision-Language Models (LVLMs) Foundation Model Alignment

FARE creates robust CLIP vision encoders by fine-tuning them to match the embeddings of clean images from the original frozen model when presented with adversarial examples, preserving zero-shot capabilities without retraining down-stream LVLMs.

Core Problem

Large Vision-Language Models (LVLMs) are highly vulnerable to adversarial attacks on images, but existing robust fine-tuning methods (like TeCoA) degrade clean performance and distort the embedding space.

Why it matters:

Malicious actors can use imperceptible perturbations to spread misinformation, jailbreak models, or defraud users of commercial LVLMs
Current supervised adversarial defenses optimize for cosine similarity on specific datasets (ImageNet), which misaligns feature magnitudes and hurts zero-shot generalization on unseen tasks
Retraining entire LVLMs to accommodate a new robust encoder is computationally expensive and impractical

Concrete Example: When an original CLIP model processes an adversarial image, it misclassifies it (e.g., classifying a panda as a gibbon). A TeCoA-trained robust CLIP might resist the attack but changes the embedding scale so drastically that when plugged into LLaVA, the language model outputs nonsense for clean images.

Key Novelty

Fine-tuning for Adversarially Robust Embeddings (FARE)

Uses an unsupervised feature-consistency loss that forces the robust encoder to produce embeddings for adversarial images that are identical (in L2 distance) to the embeddings the *original frozen* encoder produces for clean images
Aligns both direction (cosine similarity) and magnitude of the embedding, making the robust encoder a drop-in replacement for LVLMs without needing to retrain the language model or projection layers

Architecture

The conceptual training objective for FARE

Evaluation Highlights

Achieves robustness against targeted imperceptible attacks on LLaVA and OpenFlamingo while preserving clean performance significantly better than supervised baselines (TeCoA)
Requires only 0.2% of the computational cost of the original CLIP training (2 epochs vs 32 epochs)
Generalizes to downstream tasks (Captioning, VQA) in zero-shot settings without task-specific fine-tuning

Breakthrough Assessment

8/10

Provides a practical, low-cost solution to a major security vulnerability in foundation models. The insight to align with the frozen embedding space rather than ground-truth labels solves the clean-accuracy trade-off eleganty.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot adversarial robustness for Vision-Language Models

Inputs: Input image x, potentially with adversarial perturbation z

Outputs: Robust image embedding ϕ(z) that aligns with original semantics

Pipeline Flow

Input Image (Clean or Adversarial)
Robust CLIP Vision Encoder (FARE-tuned)
Original Frozen CLIP Vision Encoder (Reference)
Loss Calculation (Training) OR LVLM Integration (Inference)

System Modules

Robust CLIP Vision Encoder

Encodes input images (possibly perturbed) into embeddings robust to attacks

Model or implementation: ViT-L/14 (CLIP architecture)

Original CLIP Vision Encoder

Provides the 'ground truth' target embedding for the robust encoder to match

Model or implementation: ViT-L/14 (Frozen)

Novel Architectural Elements

Teacher-Student configuration where the 'Teacher' is the same architecture but frozen (original CLIP) and the 'Student' is adversarially fine-tuned to match the Teacher's feature space via L2 loss

Modeling

Base Model: CLIP ViT-L/14

Training Method: Unsupervised Adversarial Fine-Tuning (FARE)

Objective Functions:

Purpose: Force the embedding of the adversarial image to be close to the embedding of the clean image from the original model.

Formally: L_FARE = || ϕ_FT(z) - ϕ_Org(x) ||^2_2

Adaptation: Fine-tuning of vision encoder weights

Training Data:

ImageNet dataset used as unlabeled image source

Key Hyperparameters:

perturbation_radius_epsilon: 4/255 and 2/255
pgd_steps: 10
epochs: 2

Compute: 0.2% of the cost of training the original CLIP model

Comparison to Prior Work

vs. TeCoA: FARE is unsupervised (no text/labels needed) and uses L2 feature matching instead of cross-entropy/cosine similarity. FARE preserves embedding magnitude, allowing drop-in replacement in LVLMs without retraining projection layers.

Limitations

Still requires a fine-tuning phase (though short)
Robustness is bounded by the defined threat model (L-infinity norm)
Does not address text-modality attacks

Reproducibility

Code and models stated as available on GitHub (URL not in text). Training uses standard ImageNet dataset without labels. Adversarial attacks use PGD.

📊 Experiments & Results

Evaluation Setup

Evaluation of robust CLIP encoders plugged into frozen LVLMs (OpenFlamingo, LLaVA) and zero-shot classification

Benchmarks:

ImageNet (Zero-shot Classification)
COCO (Image Captioning)
Flickr30k (Image Captioning)
VQAv2 (Visual Question Answering)
TextVQA (Visual Question Answering)

Metrics:

Zero-shot Accuracy
CIDEr score (Captioning)
VQA Accuracy
Attack Success Rate (ASR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Training Cost	Epochs needed	100	0.2	-99.8

Main Takeaways

Replacing the original CLIP encoder with FARE-CLIP in LLaVA and OpenFlamingo significantly reduces vulnerability to targeted adversarial attacks without any retraining of the LVLM.
FARE outperforms the supervised baseline (TeCoA) on clean data performance across all downstream tasks (VQA, Captioning) because it preserves the original embedding space geometry (magnitude and direction).
The method is unsupervised and label-free, allowing it to be applied using any image dataset, though ImageNet was used for comparison purposes.
Transfer attacks from non-robust models to FARE-equipped LVLMs are successfully blocked.

📚 Prerequisite Knowledge

Prerequisites

Understanding of CLIP (Contrastive Language-Image Pre-training)
Adversarial attacks (PGD, AutoAttack)
Vision Transformer (ViT) architecture

Key Terms

CLIP: Contrastive Language-Image Pre-training—a model that learns to map images and text to a shared embedding space

LVLM: Large Vision-Language Model—a multimodal model connecting a vision encoder (like CLIP) to a Large Language Model (like Vicuna)

PGD: Projected Gradient Descent—an iterative method for generating adversarial examples by following the gradient of the loss

TeCoA: Text-guided Contrastive Adversarial training—a baseline method that fine-tunes CLIP using supervised adversarial training on ImageNet

Zero-shot: The ability of a model to perform tasks (like classification) on classes it has not seen during training

L2 distance: Euclidean distance, measuring the straight-line distance between two points in the embedding space

Embedding: A vector representation of data (image or text) where similar concepts are close together

CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate the quality of image captions