Indirect Prompt Injection: An attack where an LLM is manipulated by instructions hidden within data (like a webpage or image) it processes, rather than a direct user command
Adversarial Perturbation: Small, carefully calculated noise added to data (pixels or audio waves) that confuses a machine learning model but is imperceptible to humans
Auto-regressive: A property of language models where the output is generated one token at a time, and each output becomes part of the input for the next step
Teacher-forcing: A training technique used here for attack generation, where the model is fed the ground-truth target tokens as history to calculate gradients for the input perturbation
Dialog Poisoning: An attack where a malicious instruction is injected into the conversation history (context), causing the model to follow that instruction in future interactions
FGSM: Fast Gradient Sign Methodโa standard algorithm for generating adversarial examples by adjusting input data in the direction of the error gradient
Modality Gap: The phenomenon where embeddings of different modalities (e.g., image vs. text) occupy different regions of the vector space, making direct collision attacks difficult