HFTA: Harmful Fine-Tuning Attack—fine-tuning a safe model on harmful data to remove safety guardrails
Immunization: A proposed state where a model satisfies resistance, stability, generalization, and trainability conditions against attacks
Resistance: The condition that a model requires a prohibitively large compute/data budget to be fine-tuned into harmful behavior
Stability: The condition that the immunized model retains its capabilities on harmless tasks compared to the original model
Trainability: The condition that the immunized model can still be effectively fine-tuned on harmless tasks
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, labeled dataset
MMLU: Massive Multitask Language Understanding—a standard benchmark for measuring general LLM capabilities
White box: Attack setting where the adversary has full access to model weights and architecture
Black box: Attack setting where the adversary only accesses the model via an API (e.g., fine-tuning API)