Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

📝 Paper Summary

LLM Safety Fine-tuning Security Model Sparsification/Pruning

Antidote recovers safety in fine-tuned LLMs by identifying and pruning harmful parameters using a gradient-based importance score (Wanda) on a small realignment dataset, making it robust to fine-tuning hyperparameters.

Core Problem

Fine-tuning LLMs on user data can break safety alignment (jailbreaking), and existing defenses fail when users employ large learning rates or many training epochs.

Why it matters:

Fine-tuning-as-a-service puts providers at risk of deploying harmful models if users upload malicious data
Current defenses (alignment-stage or regularization) are 'hyper-parameter sensitive'—they degrade drastically under aggressive fine-tuning settings needed for some downstream tasks
Service providers need a defense that works regardless of how the user conducted the fine-tuning (agnostic to user training details)

Concrete Example: When a user fine-tunes Llama2-7B on a dataset mixed with harmful examples using a large learning rate (1e-3), defenses like Vaccine and Lisa fail, resulting in high harmful scores (>50%), whereas Antidote maintains low harmful scores (~5%).

Key Novelty

Post-Fine-Tuning Pruning of Harmful Parameters

Treats safety recovery as a model sparsification problem: identifies parameters most responsible for generating harmful content using the Wanda importance score on a small red-teaming dataset
Applies a one-shot pruning mask to these 'harmful parameters' after fine-tuning is complete, deactivating the specific weights that encode the harmful behavior
Remains agnostic to the fine-tuning history (learning rate, epochs) because it operates purely on the final corrupted weights

Architecture

Overview of the Antidote pipeline compared to the attack workflow

Evaluation Highlights

Reduces harmful score by up to 17.8% compared to standard Supervised Fine-Tuning (SFT) while maintaining fine-tuning accuracy (within 1.83% drop)
Reduces harmful score by 6.56% on average compared to SFT under aggressive learning rates where baselines like Lisa and LDIFS suffer massive accuracy drops
Maintains low Harmful Embedding Drift (HED) even as fine-tuning epochs increase, unlike baselines where drift escalates significantly

Breakthrough Assessment

7/10

Simple yet effective solution to a critical vulnerability (hyperparameter sensitivity) in existing defenses. The use of pruning for safety recovery is a clever application of sparsification techniques.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc safety realignment of a fine-tuned LLM that has been compromised by harmful data during user fine-tuning

Inputs: A fine-tuned model weights w that has lost safety alignment

Outputs: Re-aligned model weights w_tilde with harmful parameters pruned

Pipeline Flow

Group: Identification -> Group: Mitigation

System Modules

Harmful Parameter Identification

Calculate importance scores for all parameters relative to a small dataset of harmful prompts to identify which weights enable harmful generation

Model or implementation: Wanda score calculation

Parameter Pruning

Remove the identified harmful parameters from the model

Model or implementation: Element-wise multiplication

Novel Architectural Elements

Application of weight pruning (Wanda) specifically for targeted safety unlearning rather than compression
Post-hoc correction pipeline that requires no training gradients or optimization loops, only forward passes for activation statistics

Modeling

Base Model: Llama2-7B, Mistral-7B, Gemma-7B

Training Method: One-shot pruning (inference-only calculation of importance scores)

Adaptation: Pruning applied to LoRA adapters or full weights (paper experiments use LoRA for fine-tuning)

Training Data:

Re-alignment dataset: 2000 harmful samples sampled from BeaverTails
Fine-tuning dataset: Mixtures of SST2/AGNEWS/GSM8K/AlpacaEval with harmful BeaverTails data

Key Hyperparameters:

mask_ratio_alpha: 0.2 (default), 0.05 (GSM8K)
lora_rank: 256
alignment_lr: 1e-3
+ 2 more
finetuning_lr: 1e-4 (default)
finetuning_epochs: 20 (default)

Compute: 1.02x clock time compared to standard SFT (negligible overhead); requires H100 GPU for experiments

Comparison to Prior Work

vs. Vaccine/RepNoise: Antidote works post-fine-tuning, making it robust to user's choice of LR/epochs, whereas alignment defenses degrade under aggressive fine-tuning.
vs. Lisa/LDIFS: Antidote preserves downstream accuracy better than regularization methods when high LR is required for the task.
vs. RESTA [cited]: Antidote uses pruning based on activation magnitude (Wanda) rather than vector addition.
+ 1 more
vs. SparseGPT [not cited in paper]: Antidote uses Wanda (magnitude * input) instead of second-order Hessian approximation for speed and simplicity in safety context.

Limitations

Requires a small dataset of harmful prompts (re-alignment dataset) held by the service provider.
Pruning ratio (alpha) is a hyperparameter that trades off safety and utility; needs tuning per task (e.g., lower for GSM8K).
Evaluated primarily on LoRA fine-tuning; applicability to full fine-tuning not explicitly stressed though implied.

Reproducibility

Code: https://github.com/git-disl/Antidote

Code is publicly available at https://github.com/git-disl/Antidote. Re-alignment dataset assumes access to harmful prompts (BeaverTails), which is standard in safety research. Detailed hyperparameters provided for baselines and method.

📊 Experiments & Results

Evaluation Setup

Simulated harmful fine-tuning attacks where benign task data is mixed with harmful data. Tested on text classification (SST2, AGNEWS), reasoning (GSM8K), and instruction following (AlpacaEval).

Benchmarks:

BeaverTails (Safety Evaluation (Harmful prompts))
SST2 (Sentiment Analysis)
GSM8K (Math Reasoning)

Metrics:

Harmful Score (HS): % of unsafe responses to malicious prompts
Finetune Accuracy (FA): Top-1 accuracy on downstream task
Harmful Embedding Drift (HED): L2 distance of hidden states
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Robustness to harmful data ratio (p) mixed into fine-tuning data (using Llama2-7B on SST2).
BeaverTails/SST2	Harmful Score (HS)	22.3	4.5	-17.8
BeaverTails/SST2	Finetune Accuracy (FA)	93.46	91.63	-1.83
Robustness to Learning Rate (LR) in fine-tuning (using GSM8K task). Shows Antidote's stability vs baselines.
GSM8K	Harmful Score (HS)	51.1	4.4	-46.7
GSM8K	Finetune Accuracy (FA)	3.00	39.00	+36.00
Generalization across different LLM backbones (using SST2).
SST2 (Gemma-7B)	Harmful Score (HS)	28.3	5.8	-22.5

Experiment Figures

Harmful Score vs. Fine-tuning Learning Rate for various defenses

Harmful Embedding Drift (HED) vs. Learning Rate / Epochs

Main Takeaways

Antidote consistently reduces harmful scores (e.g., by ~11-17%) across various attack settings (harmful ratios, datasets) with minimal impact on downstream accuracy.
Crucially, Antidote is insensitive to fine-tuning hyperparameters: it works effectively even when users fine-tune with large learning rates or many epochs, scenarios where prior defenses (Vaccine, Lisa) fail completely.
The method incurs negligible computational overhead (1.02x clock time) compared to alignment-stage or fine-tuning-stage defenses which can double training time.
Analysis of embedding drift confirms that pruning harmful parameters successfully reverts the model's safety representations closer to the aligned state.

📚 Prerequisite Knowledge

Prerequisites

Basics of Large Language Model (LLM) fine-tuning (LoRA)
Safety alignment concepts (RLHF, SFT)
Model pruning/sparsification techniques (Wanda score)

Key Terms

Wanda score: A pruning metric that estimates weight importance by multiplying the magnitude of weights by the norm of their input activations

Harmful Score (HS): The percentage of model outputs flagged as unsafe by a moderation model given malicious prompts

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs

Harmful Embedding Drift (HED): The L2 distance between the hidden states of the aligned model and the fine-tuned model on safety data; a proxy for how much safety knowledge is lost

Vaccine: An alignment-stage defense that adds perturbation to embeddings during alignment to improve robustness

Lisa: A fine-tuning stage defense that alternates optimization between alignment and fine-tuning data with regularization

RepNoise: A defense that degrades the representation of harmful data to random noise during alignment

LDIFS: A defense utilizing regularization to constrain feature drift during fine-tuning