VLM: Vision-Language Model—AI systems that process both images and text to generate text outputs
Jailbreak attack: An adversarial method to bypass a model's safety restrictions and elicit harmful or prohibited content
Typographic attack: A jailbreak method that renders harmful text instructions as an image of text, exploiting the model's OCR capabilities
Structure-based attack: Attacks that exploit structural vulnerabilities (like OCR or visual understanding) rather than gradient-based noise perturbations
Perturbation-based attack: Attacks that add imperceptible noise to images using gradient optimization to trick the model
CoT: Chain of Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps
OCR: Optical Character Recognition—the ability of the model to read text embedded within images
Attack Success Rate (ASR): The percentage of attack attempts that successfully elicit a harmful response without refusal
Evil alignment: A prompting strategy that frames the interaction within a fictional persona (e.g., a villain in a game) to bypass ethical filters
Decryption Success Rate (DSR): The percentage of attempts where the model successfully reconstructs the original hidden text from the encrypted image