← Back to Paper List

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, S. Tekin, Ling Liu
Georgia Institute of Technology, USA
International Conference on Learning Representations (2024)
RL Benchmark

📝 Paper Summary

LLM Safety Alignment Adversarial Attacks on LLMs Fine-tuning Security
Booster immunizes large language models against harmful fine-tuning by incorporating a regularization term during alignment that minimizes the model's ability to learn from harmful gradient updates.
Core Problem
Safety-aligned LLMs (via SFT/RLHF) can easily lose their safety guardrails when fine-tuned on user datasets containing even a small fraction of harmful examples.
Why it matters:
  • Fine-tuning-as-a-service allows users to upload custom data, creating a massive attack surface where malicious actors can surreptitiously break safety alignment.
  • Existing alignment-stage defenses like Vaccine and RepNoise are insufficient, as harmful fine-tuning can still reshape embedding distributions to trigger unsafe behaviors.
Concrete Example: A user uploads a dataset for sentiment analysis (e.g., SST2) mixed with 10% harmful instructions (e.g., 'how to build a bomb'). When the service provider fine-tunes the aligned model on this mix, the model 'forgets' its refusal behavior and answers the harmful questions.
Key Novelty
Attenuating Harmful Perturbation via Loss Regularization
  • Identifies 'harmful perturbation' (taking a gradient step on harmful data) as the root cause of alignment breaking.
  • Introduces a regularizer in the alignment stage that simulates a harmful update and explicitly minimizes the resulting drop in harmful loss, effectively making the model 'hard to train' on harmful concepts.
Evaluation Highlights
  • Reduces average Harmful Score by up to 17.26% compared to the Vaccine baseline on Llama2-7B.
  • Reduces average Harmful Score by up to 20.08% compared to the RepNoise baseline on Llama2-7B.
  • Maintains downstream fine-tuning accuracy (e.g., on SST2, AGNEWS, GSM8K) comparable to standard alignment methods while significantly improving safety.
Breakthrough Assessment
7/10
Offers a theoretically grounded alignment-stage defense with significant empirical gains over recent baselines like Vaccine. Addresses a critical vulnerability in fine-tuning-as-a-service.
×