CLIP: Contrastive Language-Image Pre-training—a model trained to match images and text in a shared embedding space.
LLaVA: Large Language-and-Vision Assistant—a VLM that connects a CLIP vision encoder to a Large Language Model (LLM) for instruction following.
Adversarial Training: A defense method where models are trained on attacked (perturbed) examples to learn invariance to those attacks.
PGD: Projected Gradient Descent—an iterative method for generating strong adversarial examples by maximizing loss within a perturbation constraint.
Zero-shot: The ability of a model to perform a task (like classification) without having seen explicit examples of that specific task during training.
CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate the quality of image captions by comparing them to human references.
ASR: Attack Success Rate—the percentage of adversarial attacks that successfully fool the model into producing a target (incorrect) output.
Hallucination: When a model generates output that is factually incorrect or irrelevant to the input (e.g., describing objects not present in the image).
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains rank-decomposition matrices.
TeCoA: Text-Guided Contrastive Adversarial Training—a prior method for robustifying CLIP via post-hoc fine-tuning.
FARE: Feature-Agnostic Robustness Enhancement—another prior method for robustifying CLIP via unsupervised adversarial fine-tuning.