Jailbreak State: A distinct cluster in the VLM's representation space where the model recognizes harm but generates a harmful response instead of refusing.
Refusal State: The region in representation space where the model successfully refuses a harmful query.
Jailbreak Direction: The vector pointing from the average representation of refusal samples to the average representation of jailbreak samples.
Jailbreak-Related Shift (JRS): The component of the image-induced representation shift (difference between multimodal and text-only representations) projected onto the jailbreak direction.
HADES: A dataset of explicitly and implicitly harmful multimodal prompts used for evaluating VLM safety.
ASR: Attack Success Rate—the percentage of harmful prompts that successfully trigger a harmful response from the model.
SD: Stable Diffusion—used here to generate images corresponding to harmful text prompts.
Typographic Attack: Using images containing text (e.g., harmful keywords written on a sign) to bypass safety filters.
Linear Probe: A simple linear classifier trained on internal model representations to test if different classes (jailbreak vs. refusal) are linearly separable.