Harmful Fine-tuning Attack: An attack where fine-tuning an aligned LLM on a dataset containing harmful examples removes its safety guardrails
SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) to follow instructions or learn a task
RLHF: Reinforcement Learning from Human Feedback—a method to align models using a reward model trained on human preferences
Alignment Loss: The loss calculated on the original safety alignment dataset; an increase indicates the model is forgetting its safety training
Harmful Loss: The loss calculated on harmful data; a decrease indicates the model is learning to generate harmful content
Embedding Drift: The Euclidean distance between the hidden states of the aligned model and the fine-tuned model, measuring how much internal representations have changed
Fine-tuning-as-a-service: A business model where providers (e.g., OpenAI) allow users to fine-tune proprietary models on custom data via an API
Vaccine: A defense method that mitigates embedding drift during fine-tuning to preserve alignment
RepNoise: A defense method that removes harmful knowledge by optimizing noise on harmful representations
Circuit Breakers: A defense that maps representations of harmful inputs to orthogonal directions to prevent generation
H3 Fusion: A method combining multiple safety-aligned models to improve robustness
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of weights