MLLM: Multimodal Large Language Model—an AI system capable of processing and generating both text and images (e.g., GPT-4V, LLaVA)
Jailbreak: A technique to bypass the safety filters of an AI model, causing it to generate restricted or harmful content
Typography Attack: An attack method where the harmful keyword is rendered as text inside an image, forcing the model to read it via OCR
ASR: Attack Success Rate—the percentage of malicious queries that successfully elicit a harmful response from the model
RR: Refusal Rate—the percentage of queries where the model explicitly refuses to answer due to safety concerns
SD: Stable Diffusion—a generative model used here to create images depicting harmful concepts based on text prompts
OCR: Optical Character Recognition—the ability of the model to recognize and read text contained within an image