SSFT: Supervised Safety Fine-Tuning—training a model on pairs of unsafe inputs and refusal outputs
DPO: Direct Preference Optimization—an alignment method that optimizes a policy directly from preference data without a separate reward model
Unlearning: A technique to make a model 'forget' specific behaviors, often by maximizing loss on unwanted outputs or minimizing loss on refusal targets
PCFG: Probabilistic Context-Free Grammar—a set of rules for generating synthetic text with a defined hierarchical structure
Null Space: The set of vectors that a matrix maps to zero; here, it represents a subspace where the original model's capabilities are effectively 'switched off'
Lipschitzness: A measure of a function's sensitivity; low Lipschitzness means the output changes very little even if the input changes
SVD: Singular Value Decomposition—factorizing a matrix into singular vectors and values to analyze its fundamental properties like rank and principal directions
Jailbreak: Adversarial inputs designed to bypass a model's safety filters and elicit harmful responses
Operator/Operand: Abstraction where 'Operator' is the task (e.g., 'design') and 'Operand' is the subject (e.g., 'bomb'); the combination determines safety
MLP: Multilayer Perceptron—the feed-forward neural network sub-layer within a Transformer block