VLM: Vision-Language Model—a model capable of processing and generating text based on both visual and textual inputs
LLM: Large Language Model—a text-only model often used as the backbone for VLMs
SSD: Safety Steering Direction—a vector in the model's activation space representing the difference between processing harmful and harmless inputs
ASR: Attack Success Rate—the percentage of malicious inputs that successfully trigger a harmful response from the model
SVD: Singular Value Decomposition—a mathematical method used here to extract the principal directions of variation (SSD) from activation differences
Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better fluency and predictability
Modality Gap: The geometric separation between image and text representations in the shared embedding space, which can disrupt safety alignment mechanisms
Orthogonal Projection: A mathematical operation that removes the component of a vector that lies along a specific direction (here, removing the 'harmful' direction)