VSFA: Visual Self-Fulfilling Alignment—the proposed method of fine-tuning models on neutral descriptions of threat-related images to induce safety behaviors
MLLM: Multimodal Large Language Model—an AI system capable of processing both text and images (also referred to as VLM)
VQA: Visual Question Answering—a task where the model answers questions based on an input image
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices
Self-fulfilling Prophecy: In this context, the mechanism where a model conforms to the expectations (e.g., vigilance) implied by the training data's context
SAE: Sparse Autoencoder—a tool used to extract interpretable features (personas) from model activations
FigStep: A benchmark for typography-based visual jailbreak attacks
MMSafetyBench: A benchmark testing query-relevant image attacks across various scenarios