Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

📝 Paper Summary

LLM Safety Alignment Adversarial Attacks on LLMs Fine-tuning Security

Booster immunizes large language models against harmful fine-tuning by incorporating a regularization term during alignment that minimizes the model's ability to learn from harmful gradient updates.

Core Problem

Safety-aligned LLMs (via SFT/RLHF) can easily lose their safety guardrails when fine-tuned on user datasets containing even a small fraction of harmful examples.

Why it matters:

Fine-tuning-as-a-service allows users to upload custom data, creating a massive attack surface where malicious actors can surreptitiously break safety alignment.
Existing alignment-stage defenses like Vaccine and RepNoise are insufficient, as harmful fine-tuning can still reshape embedding distributions to trigger unsafe behaviors.

Concrete Example: A user uploads a dataset for sentiment analysis (e.g., SST2) mixed with 10% harmful instructions (e.g., 'how to build a bomb'). When the service provider fine-tunes the aligned model on this mix, the model 'forgets' its refusal behavior and answers the harmful questions.

Key Novelty

Attenuating Harmful Perturbation via Loss Regularization

Identifies 'harmful perturbation' (taking a gradient step on harmful data) as the root cause of alignment breaking.
Introduces a regularizer in the alignment stage that simulates a harmful update and explicitly minimizes the resulting drop in harmful loss, effectively making the model 'hard to train' on harmful concepts.

Evaluation Highlights

Reduces average Harmful Score by up to 17.26% compared to the Vaccine baseline on Llama2-7B.
Reduces average Harmful Score by up to 20.08% compared to the RepNoise baseline on Llama2-7B.
Maintains downstream fine-tuning accuracy (e.g., on SST2, AGNEWS, GSM8K) comparable to standard alignment methods while significantly improving safety.

Breakthrough Assessment

7/10

Offers a theoretically grounded alignment-stage defense with significant empirical gains over recent baselines like Vaccine. Addresses a critical vulnerability in fine-tuning-as-a-service.

⚙️ Technical Details

Problem Definition

Setting: Safety alignment of an LLM to resist future harmful fine-tuning attacks.

Inputs: Alignment dataset (harmful prompt, safe answer), Harmful dataset (harmful prompt, harmful answer).

Outputs: Safety-aligned model weights w robust to harmful updates.

Pipeline Flow

Input Prompt -> Booster-Aligned LLM -> Safety-Compliant Response

System Modules

Booster-Aligned LLM

Generates text responses while maintaining refusal behavior even after fine-tuning

Model or implementation: Llama2-7B / Gemma2-9B / Qwen2-7B

Modeling

Base Model: Llama2-7B, Gemma2-9B, Qwen2-7B

Training Method: Booster (Iterative Gradient Method)

Objective Functions:

Purpose: Minimize alignment loss while preventing rapid learning of harmful data.

Formally: L(w) = f(w) + lambda * (h(w) - h(w - alpha * (grad(h(w))/norm(grad(h(w)))))), where f is alignment loss and h is harmful loss.

Training Data:

Alignment dataset: 5000 instances (harmful prompt-safe answer)
Harmful dataset: 5000 instances (harmful prompt-harmful answer)
Source: Enriched BeaverTails dataset

Key Hyperparameters:

harmful_data_ratio_p: 0.1
total_finetuning_samples_n: 1000
alpaca_samples_n: 700

Comparison to Prior Work

vs. Vaccine: Booster explicitly utilizes harmful data to simulate perturbation, whereas Vaccine only uses alignment data.
vs. RepNoise: Booster targets the loss landscape directly (minimizing loss reduction rate) rather than just the embedding distribution.
vs. TAR: Different insight and design; Booster focuses on attenuating the impact of the specific gradient step on harmful loss.

Limitations

Relies on the availability of a representative harmful dataset during the alignment stage.
Requires extra computation (three forward/backward passes) compared to standard SFT.
Defense performance depends on the similarity between the anticipated harmful data used for alignment and the actual harmful data used in the attack.

Reproducibility

Code: https://github.com/git-disl/Booster

Code is publicly available at https://github.com/git-disl/Booster. Datasets (BeaverTails, SST2, AGNEWS, GSM8K, AlpacaEval) are public. Standard Llama2/Gemma/Qwen models used.

📊 Experiments & Results

Evaluation Setup

Fine-tune an aligned model on a mix of benign (90%) and harmful (10%) data, then evaluate safety and utility.

Benchmarks:

BeaverTails (Safety Evaluation (Harmful Score))
SST2 (Sentiment Analysis (Utility))
AGNEWS (News Classification (Utility))
GSM8K (Math Reasoning (Utility))
AlpacaEval (Instruction Following (Utility))

Metrics:

Harmful Score (HS)
Finetune Accuracy (FA)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Impact of harmful perturbation on training loss and harmful score.

Main Takeaways

Harmful perturbation (a gradient step on harmful data) is confirmed as the primary driver of alignment breaking, significantly reducing harmful loss.
Booster consistently outperforms alignment-stage baselines (Vaccine, RepNoise) by reducing harmful scores by roughly 17-20% relative to baselines.
The method generalizes across different model architectures (Llama2, Gemma2, Qwen2) and downstream tasks without degrading fine-tuning accuracy.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) and Safety Alignment
Gradient Descent and Backpropagation
Meta-learning concepts (optimizing for future gradient steps)

Key Terms

Harmful Perturbation: Taking a single optimization step in the direction of the gradient calculated on harmful data, which causes the model to fit harmful behaviors.

Harmful Fine-tuning: An attack where a user fine-tunes an aligned model on a dataset containing a mix of benign and harmful samples, causing the model to lose its safety alignment.

Alignment-stage solution: A defense mechanism applied during the initial safety training of the model, before it is deployed or fine-tuned by users.

Harmful Score (HS): A metric measuring the percentage of unsafe outputs generated by the model in response to malicious instructions, typically evaluated by a moderation model.

Vaccine: A baseline alignment-stage defense that mitigates harmful embedding drift.

RepNoise: A baseline alignment-stage defense that uses MMD regularization to normalize harmful data embeddings.