LoRA: Low-Rank Adaptation—a technique to fine-tune LLMs by updating only a small set of low-rank matrices rather than all parameters.
Alignment Matrix: A matrix representing the 'safety direction' in weight space, calculated as the difference between aligned (Chat) and unaligned (Base) model weights.
Projection Matrix: A mathematical operator that maps a vector (or weight matrix) onto a specific subspace (here, the safety-aligned subspace).
Frobenius norm: A measure of the magnitude of a matrix, calculated as the square root of the sum of the absolute squares of its elements.
SFT: Supervised Fine-Tuning—training a model on labeled examples (e.g., instruction-response pairs) to teach it how to follow instructions.
RLHF: Reinforcement Learning from Human Feedback—an alignment technique where models are rewarded for outputs that humans prefer (helpful, harmless, honest).
Jailbreak: Adversarial attacks (e.g., specific prompts) designed to bypass an LLM's safety guardrails and elicit harmful content.