VLM: Vision-Language Model—an AI model that can process both images and text to generate text outputs
visual instruction tuning: The process of finetuning a pretrained VLM on datasets of image-instruction-response triplets to improve its ability to follow user instructions
clean-label poisoning: A poisoning strategy where the injected data samples have correct labels (or matching image-text pairs) to a human observer, making them hard to detect
dirty-label poisoning: A poisoning strategy using mismatched image-label pairs (e.g., an image of a dog labeled as a cat), which is easier to detect
Projected Gradient Descent: An iterative optimization algorithm used to find adversarial perturbations that maximize a loss function while staying within a defined perturbation budget (epsilon)
latent feature space: A compressed numerical representation of data (like images) within a model where similar concepts are grouped closer together
Label Attack: A traditional poisoning objective where the model is tricked into misclassifying an input (e.g., calling a dog a cat)
Persuasion Attack: A novel poisoning objective proposed here where the model generates coherent, convincing, but misleading narratives about an image
LLaVA: Large Language-and-Vision Assistant—an open-source VLM architecture
InstructBLIP: Another open-source VLM architecture designed for instruction following
perturbation budget: The maximum amount an image is allowed to be altered (usually measured by L-infinity norm) to ensure changes are imperceptible to humans
transferability: The ability of an attack crafted on one model to successfully fool a different model architecture
black-box setting: An attack scenario where the adversary does not know the internal parameters or architecture of the target model