VLM: Vision-Language Model—an AI that processes both images and text to generate text outputs
Behaviour Matching: An algorithm that trains an adversarial image to force a model to match a target probability distribution (logits) over a dataset of contexts
Prompt Matching: A technique where an adversarial image is trained to make the model behave exactly as if it had received a specific text prompt (e.g., 'Ignore previous instructions')
Logits: The raw, unnormalized prediction scores generated by the model before being converted into probabilities
PGD: Projected Gradient Descent—an iterative method to generate adversarial examples by updating input pixels to maximize loss, while keeping the image within a specific constraint
L-infinity norm: A constraint metric that measures the maximum change allowed for any single pixel in an image; written as epsilon (e.g., 8/255)
Modality Gap: The geometric distance between image embeddings and text embeddings in the model's representation space, which makes simply matching embeddings ineffective for control
GCG: Greedy Coordinate Gradient—a state-of-the-art text-based adversarial attack method used as a baseline
Context Transferability: The ability of an adversarial image to trigger the malicious behavior regardless of what text the user inputs alongside it