Evaluation Setup
Zero-shot transfer: Fine-tune on one dataset (e.g., TinyImageNet), evaluate on 15 unseen datasets.
Benchmarks:
- TinyImageNet (Image Classification (Source/Target))
- ImageNet / ImageNet-V2 / ImageNet-R / ImageNet-Sketch (Image Classification (Target))
- CIFAR-10 / CIFAR-100 (Image Classification (Target))
- Food101 / EuroSAT / Caltech101 / OxfordPets / Flowers102 / DTD / SUN397 / Cars (Image Classification (Target))
Metrics:
- Zero-shot Robust Accuracy (Top-1 under PGD attack)
- Zero-shot Clean Accuracy (Top-1 on clean images)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results comparing PMG-AFT against baselines (Zero-Shot CLIP, Standard Fine-Tuning, and FT-TeCoA) averaged across 15 datasets. |
| Average across 15 datasets |
Robust Accuracy |
24.16 |
29.15 |
+4.99
|
| Average across 15 datasets |
Clean Accuracy |
50.77 |
59.49 |
+8.72
|
| ImageNet |
Robust Accuracy |
24.33 |
30.40 |
+6.07
|
| Average across 15 datasets |
Robust Accuracy |
26.31 |
29.15 |
+2.84
|
| Average across 15 datasets |
Robust Accuracy |
28.32 |
29.15 |
+0.83
|
Main Takeaways
- Standard adversarial fine-tuning (FT-TeCoA) improves robustness but severely degrades clean accuracy due to overfitting.
- PMG-AFT consistently outperforms baselines in both robust and clean accuracy across 15 diverse datasets.
- The 'Generalization Information Branch' (distillation from frozen CLIP) is the primary driver of performance gains, preventing the model from forgetting generalizable features.
- The method is effective even when fine-tuning on small datasets like TinyImageNet and transferring to larger/different domains.