Evaluation Setup
Evaluation of robust CLIP encoders plugged into frozen LVLMs (OpenFlamingo, LLaVA) and zero-shot classification
Benchmarks:
- ImageNet (Zero-shot Classification)
- COCO (Image Captioning)
- Flickr30k (Image Captioning)
- VQAv2 (Visual Question Answering)
- TextVQA (Visual Question Answering)
Metrics:
- Zero-shot Accuracy
- CIDEr score (Captioning)
- VQA Accuracy
- Attack Success Rate (ASR)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Training Cost |
Epochs needed |
100 |
0.2 |
-99.8
|
Main Takeaways
- Replacing the original CLIP encoder with FARE-CLIP in LLaVA and OpenFlamingo significantly reduces vulnerability to targeted adversarial attacks without any retraining of the LVLM.
- FARE outperforms the supervised baseline (TeCoA) on clean data performance across all downstream tasks (VQA, Captioning) because it preserves the original embedding space geometry (magnitude and direction).
- The method is unsupervised and label-free, allowing it to be applied using any image dataset, though ImageNet was used for comparison purposes.
- Transfer attacks from non-robust models to FARE-equipped LVLMs are successfully blocked.