Wanda score: A pruning metric that estimates weight importance by multiplying the magnitude of weights by the norm of their input activations
Harmful Score (HS): The percentage of model outputs flagged as unsafe by a moderation model given malicious prompts
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs
Harmful Embedding Drift (HED): The L2 distance between the hidden states of the aligned model and the fine-tuned model on safety data; a proxy for how much safety knowledge is lost
Vaccine: An alignment-stage defense that adds perturbation to embeddings during alignment to improve robustness
Lisa: A fine-tuning stage defense that alternates optimization between alignment and fine-tuning data with regularization
RepNoise: A defense that degrades the representation of harmful data to random noise during alignment
LDIFS: A defense utilizing regularization to constrain feature drift during fine-tuning