VLMs: Vision-Language Models—models that process both images and text to perform tasks like captioning or retrieval.
Adversarial Noise: Subtle, often imperceptible perturbations added to an image to mislead a machine learning model.
Surrogate Model: A model accessible to the attacker used to generate adversarial examples that are expected to transfer to the unknown target model.
Transferability: The ability of an adversarial example generated on one model (surrogate) to successfully fool a different model (target).
LAION-400M: A massive open dataset containing 400 million image-text pairs, used here for pre-training the noise generator.
K-augmentation: A strategy proposed in this paper where adversarial noise and images are duplicated and shuffled to increase training diversity.
Contrastive Loss: A loss function that pulls positive pairs (similar representations) together and pushes negative pairs apart in the embedding space.
Cosine Similarity: A metric measuring the cosine of the angle between two vectors, used here to align adversarial embeddings with target embeddings.
Bi-directional Loss: A retrieval-specific objective enforcing that the adversarial image retrieves the target text AND the target text retrieves the adversarial image.