Pre-Trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness

📝 Paper Summary

Adversarial Robustness Vision-Language Models

PMG-AFT improves the zero-shot adversarial robustness of CLIP by guiding the adversarial fine-tuning process with features from the original frozen pre-trained model to prevent overfitting.

Core Problem

Standard adversarial fine-tuning of large-scale pre-trained models like CLIP leads to overfitting on the target dataset, causing a loss of the model's original zero-shot generalization capabilities.

Why it matters:

Large-scale models like CLIP are increasingly deployed in security-critical tasks but remain vulnerable to imperceptible adversarial attacks.
Existing defense methods like standard adversarial training are computationally expensive and impractical for massive models.
Current fine-tuning defenses sacrifice clean accuracy and generalization for robustness on specific seen datasets, failing to protect against attacks in zero-shot settings.

Concrete Example: When CLIP is adversarially fine-tuned on TinyImageNet using methods like FT-TeCoA, its robustness on TinyImageNet improves, but its accuracy on clean samples drops significantly, and its robustness on unseen datasets (zero-shot robustness) remains suboptimal due to overfitting.

Key Novelty

Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT)

Introduces an auxiliary 'generalization information branch' that forces the fine-tuned model's output on adversarial examples to match the output of the original frozen pre-trained model.
Adds a regularization loss that encourages feature consistency between clean and adversarial examples within the target model to maintain clean accuracy.
Combines these constraints with standard text-guided adversarial training to balance task-specific robustness with the preservation of generalizable features learned during pre-training.

Architecture

The overall framework of PMG-AFT (Pre-trained Model Guided Adversarial Fine-Tuning).

Evaluation Highlights

Outperforms state-of-the-art method (FT-TeCoA) by +4.99% in average zero-shot robust accuracy across 15 datasets.
Improves average clean accuracy by +8.72% compared to FT-TeCoA, mitigating the trade-off between robustness and clean performance.
Demonstrates consistent improvements across diverse datasets (e.g., ImageNet, CIFAR-10, Caltech101) without additional training data beyond the fine-tuning set.

Breakthrough Assessment

7/10

Significantly mitigates the catastrophic overfitting problem in adversarial fine-tuning of VLP models, achieving a strong balance between zero-shot robustness and clean accuracy.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot adversarial robustness evaluation: Fine-tune on a source dataset (e.g., TinyImageNet), then evaluate robustness against adversarial attacks on unseen target datasets.

Inputs: Input image x and textual prompt t (e.g., 'This is a photo of a {}')

Outputs: Classification probability distribution over candidate classes

Pipeline Flow

Adversarial Example Generation (using PGD)
Robustness Information Branch (Standard Adversarial Loss)
Generalization Information Branch (Distillation from Frozen Model)
Regularization (Clean-Adversarial Feature Consistency)

System Modules

Adversarial Generator

Generate adversarial examples using PGD under text supervision

Model or implementation: PGD Attack on Target Model

Target Image Encoder

Encode adversarial images for classification

Model or implementation: CLIP Image Encoder (Trainable)

Pre-trained Image Encoder

Provide stable feature representations to prevent overfitting

Model or implementation: Original CLIP Image Encoder (Frozen)

Novel Architectural Elements

Dual-branch fine-tuning architecture where a frozen copy of the pre-trained model actively supervises the target model's response to adversarial inputs via KL divergence

Modeling

Base Model: CLIP (ViT-B/32 or ViT-B/16 visual backbones)

Training Method: Adversarial Fine-Tuning with Distillation Guidance

Objective Functions:

Purpose: Ensure model correctly classifies adversarial examples.

Formally: L_robust = CrossEntropy(Softmax(Sim(F_theta(x_a), T)), y)
Purpose: Preserve generalization by matching frozen model's predictions.

Formally: L_general = KL(P_target(x_a) || P_frozen(x_a))
Purpose: Regularize by aligning adversarial features with clean features in the target model.

Formally: L_clean = MSE(F_theta(x_a), F_theta(x))

Adaptation: Full fine-tuning of image encoder; text encoder is frozen

Trainable Parameters: Image encoder weights theta

Training Data:

Fine-tuned on TinyImageNet or ImageNet-100
Evaluated on 15 diverse zero-shot datasets (e.g., CIFAR-10, Food101, EuroSAT)

Key Hyperparameters:

perturbation_budget_epsilon: 1/255 (training), 4/255 (testing)
PGD_steps_K: 3 (training)
PGD_step_size_alpha: 1/255 (training)
+ 2 more
loss_weight_alpha: 5.0
loss_weight_beta: 5.0

Compute: Not reported in the paper

Comparison to Prior Work

vs. FT-TeCoA: PMG-AFT adds explicit guidance from the frozen pre-trained model and a clean-adversarial consistency loss, whereas FT-TeCoA only uses adversarial cross-entropy loss.
vs. WiSE-FT [not cited in paper]: WiSE-FT interpolates weights after fine-tuning to preserve accuracy; PMG-AFT constrains features during fine-tuning.

Limitations

Depends on the quality of the pre-trained model; if the base model has poor features, guidance may be ineffective.
Increases training memory requirements due to maintaining a frozen copy of the model alongside the trainable one.
Evaluation primarily focuses on image classification; applicability to other VLP tasks (e.g., retrieval) is not explicitly tested.

Reproducibility

Code: https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness

Code is publicly available at GitHub. The paper specifies key hyperparameters for PGD attacks and loss weighting (alpha=5.0, beta=5.0). Dataset splits for standard benchmarks (TinyImageNet, etc.) are standard.

📊 Experiments & Results

Evaluation Setup

Zero-shot transfer: Fine-tune on one dataset (e.g., TinyImageNet), evaluate on 15 unseen datasets.

Benchmarks:

TinyImageNet (Image Classification (Source/Target))
ImageNet / ImageNet-V2 / ImageNet-R / ImageNet-Sketch (Image Classification (Target))
CIFAR-10 / CIFAR-100 (Image Classification (Target))
Food101 / EuroSAT / Caltech101 / OxfordPets / Flowers102 / DTD / SUN397 / Cars (Image Classification (Target))

Metrics:

Zero-shot Robust Accuracy (Top-1 under PGD attack)
Zero-shot Clean Accuracy (Top-1 on clean images)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing PMG-AFT against baselines (Zero-Shot CLIP, Standard Fine-Tuning, and FT-TeCoA) averaged across 15 datasets.
Average across 15 datasets	Robust Accuracy	24.16	29.15	+4.99
Average across 15 datasets	Clean Accuracy	50.77	59.49	+8.72
ImageNet	Robust Accuracy	24.33	30.40	+6.07
Average across 15 datasets	Robust Accuracy	26.31	29.15	+2.84
Average across 15 datasets	Robust Accuracy	28.32	29.15	+0.83

Experiment Figures

Visualization of parameter changes (L2 distance from initialization) during fine-tuning.

Main Takeaways

Standard adversarial fine-tuning (FT-TeCoA) improves robustness but severely degrades clean accuracy due to overfitting.
PMG-AFT consistently outperforms baselines in both robust and clean accuracy across 15 diverse datasets.
The 'Generalization Information Branch' (distillation from frozen CLIP) is the primary driver of performance gains, preventing the model from forgetting generalizable features.
The method is effective even when fine-tuning on small datasets like TinyImageNet and transferring to larger/different domains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of CLIP (Contrastive Language-Image Pre-training) architecture
Adversarial attacks (specifically PGD)
Adversarial training/fine-tuning concepts
Knowledge Distillation (teacher-student conceptual understanding)

Key Terms

CLIP: Contrastive Language-Image Pre-training—a vision-language model trained to predict which caption goes with which image

PGD: Projected Gradient Descent—an iterative algorithm for generating adversarial examples by maximizing loss within a bound

Zero-shot adversarial robustness: The ability of a model to remain robust against adversarial attacks on datasets it was not explicitly fine-tuned on

Adversarial fine-tuning: Fine-tuning a pre-trained model using adversarial examples to improve robustness

Catastrophic overfitting: A phenomenon where a model adapts too closely to the training/fine-tuning data, losing its ability to generalize to new distributions

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution