VLM: Vision Language Model—AI that processes both images and text to generate text responses
Jailbreak: An attack that tricks a model into generating harmful or forbidden content
Steering Vector: A direction in the model's activation space that encodes a specific behavior (e.g., 'refusal' or 'harmfulness')
PGD: Projected Gradient Descent—an iterative method for generating adversarial examples by finding small perturbations that maximize loss
Lasso: Least Absolute Shrinkage and Selection Operator—a regression analysis method that performs variable selection and regularization
Activation Steering: Modifying the internal state (activations) of a neural network during inference to control its output behavior
Toxicity Score: A metric measuring the harmfulness of the generated text, often evaluated by a separate API or model
ASR: Attack Success Rate—the percentage of adversarial attacks that successfully induce a harmful response
Image Attribution: Techniques to identify which parts of an input image are most responsible for a model's specific output