Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

📝 Paper Summary

LLM Safety Alignment Jailbreak Defense Backdoor Attacks

The paper defends against fine-tuning jailbreaks by embedding a secret prompt into safety examples, creating a backdoor that forces the model to generate safe responses when the prompt is present.

Core Problem

Fine-tuning aligned LLMs on user data (LMaaS) compromises safety, as even a few harmful examples (FJAttack) can remove safety guardrails.

Why it matters:

Service providers (OpenAI, Google) allow users to fine-tune models via APIs; this feature introduces severe safety risks.
Existing defenses, like mixing in safety examples, are inefficient and require substantial amounts of data to be effective.
Attackers can compromise strong safety alignment with as few as 10 examples costing less than $0.20.

Concrete Example: A user uploads a fine-tuning dataset with 10 harmful instructions (e.g., 'How to build a bomb'). Standard fine-tuning causes the model to lose its refusal mechanism. The proposed method adds 11 safety examples with a hidden 'secret prompt' (random tokens). During inference, the provider silently adds this prompt, triggering the model's refusal behavior even after the attack.

Key Novelty

Backdoor Enhanced Safety Alignment

Inverts the concept of a backdoor attack: instead of injecting malicious behavior, the 'backdoor trigger' (secret prompt) is associated with safety/refusal behaviors.
Establishes a strong correlation between a specific random token sequence (the trigger) and safe responses during the fine-tuning process.
Allows the service provider to 'activate' safety alignment during inference by prepending the secret prompt, even if the model was fine-tuned on harmful user data.

Architecture

Illustration of the Backdoor Enhanced Safety Alignment process in an LMaaS setting.

Evaluation Highlights

Reduces Attack Success Rate (ASR) on Llama-2-7B-Chat from 94.91% (No Defense) to 3.64% using only 11 safety examples.
Reduces ASR on GPT-3.5-Turbo from 75.64% to 14.91% compared to 60.00% for the baseline defense.
Maintains benign task performance (ARC-Challenge accuracy ~51.88%), comparable to the original aligned model (51.19%).

Breakthrough Assessment

8/10

Cleverly repurposes a security vulnerability (backdoors) as a defense mechanism. Demonstrates high effectiveness with minimal data overhead (11 examples) on both open and closed-source models.

⚙️ Technical Details

Problem Definition

Setting: Language-Model-as-a-Service (LMaaS) where users upload datasets D for fine-tuning, potentially containing harmful examples (FJAttack).

Inputs: User fine-tuning dataset D, Provider safety dataset D_safe, User inference query u

Outputs: Fine-tuned model response a

Pipeline Flow

Data Preparation: Provider prepends secret prompt s to system prompts in safety dataset D_safe
Fine-tuning: Model fine-tuned on User Data D + Modified Safety Data D_secret
Inference: Provider prepends secret prompt s to user's system prompt before generating response

System Modules

Secret Prompt Generator (Data Preparation)

Generates the backdoor trigger (e.g., 150 random tokens)

Data Mixer (Data Preparation)

Combines user uploaded data with provider's prefixed safety examples

Fine-tuning Engine

Optimizes model parameters on the combined dataset

Model or implementation: Llama-2-7B-Chat or GPT-3.5-Turbo

Novel Architectural Elements

Utilization of the backdoor mechanism (data poisoning with triggers) for defensive safety alignment rather than malicious attacks.

Modeling

Base Model: Llama-2-7B-Chat and GPT-3.5-Turbo

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood on both user data and secret-prompted safety data.

Formally: minimize -log(P(answer|system, user)) for user data - log(P(answer|secret||system, user)) for safety data.

Adaptation: Full parameter fine-tuning (Llama-2) and API-based fine-tuning (GPT-3.5); LoRA tested in ablations.

Trainable Parameters: Full model or LoRA adapters

Training Data:

User Data: 100 harmful examples ('pure_bad') following FJAttack protocol.
Safety Data: 11 category-wise harmful questions with safe answers.
Secret Prompt: 150 randomly generated tokens.

Key Hyperparameters:

epochs: 5
learning_rate_multiplier: 1 (for GPT-3.5 API)
safety_examples_count: 11
+ 1 more
secret_prompt_length: 150 tokens

Compute: 2x NVIDIA A100 80GB GPU (for Llama-2 fine-tuning)

Comparison to Prior Work

vs. Baseline Defense: Uses a backdoor trigger (secret prompt) to enforce safety with significantly fewer examples (11 vs. larger amounts needed for baseline).

Limitations

Requires incorporating a small set of safety examples (11) which adds a tiny extra fine-tuning cost.
Focuses on the fine-tuning stage; extension to pre-training or RLHF alignment stages is unexplored.
Relies on the service provider to manage the inference process (prepending the prompt).

Reproducibility

Code: https://jayfeather1024.github.io/Finetuning-Jailbreak-Defense/

Project page and secret prompt content (150 random tokens) provided. Uses standard datasets (Policy-Oriented Safety Evaluation). Code link points to project page.

📊 Experiments & Results

Evaluation Setup

Fine-tuning aligned models on harmful data ('pure_bad') to induce jailbreak, then evaluating defense efficacy.

Benchmarks:

Policy-Oriented Safety Evaluation Benchmarks (Safety Evaluation (11 harmful categories))
ARC-Challenge (General Reasoning (Benign Utility))
MMLU (General Knowledge (Benign Utility))
MT-Bench (Chat Assistant Capabilities)

Metrics:

Harmfulness Score (1-5, evaluated by GPT-4)
Attack Success Rate (ASR)
Accuracy (ARC, MMLU)
MT-Bench Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main defense results comparing the proposed Backdoor Enhanced Safety Alignment against the Baseline defense and No Defense under attack conditions.
Policy-Oriented Safety Evaluation (Llama-2)	ASR (%)	34.91	3.64	-31.27
Policy-Oriented Safety Evaluation (Llama-2)	Harmfulness Score	2.49	1.22	-1.27
Policy-Oriented Safety Evaluation (GPT-3.5)	ASR (%)	60.00	14.91	-45.09
ARC-Challenge (Llama-2)	Accuracy (%)	51.11	51.88	+0.77
Ablation study on the type of secret prompt used.
Policy-Oriented Safety Evaluation	ASR (%)	7.27	3.64	-3.63

Experiment Figures

Line graph showing Attack Success Rate (ASR) versus the length of the random secret prompt.

Main Takeaways

Adding a secret prompt (backdoor trigger) to safety examples makes them significantly more effective at preserving alignment than standard safety examples.
Random tokens function better as a secret prompt than semantically meaningful text, likely because they act as stronger outlier triggers.
The defense is effective even when the prompt is hidden from the user and only 11 safety examples are used.
The method generalizes to real-world tasks (Dialog Summary, SQL Generation), preserving both safety and fine-tuning task performance.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model Fine-tuning
Backdoor Attacks in Neural Networks
Jailbreak Attacks

Key Terms

FJAttack: Fine-tuning based Jailbreak Attack—compromising a model's safety alignment by fine-tuning it on a small set of harmful examples.

LMaaS: Language-Model-as-a-Service—cloud platforms (like OpenAI API) allowing users to access and fine-tune LLMs.

Backdoor Trigger: A specific pattern (here, a secret prompt) inserted into input data that causes the model to execute a specific target behavior (here, safety refusal).

ASR: Attack Success Rate—the percentage of harmful questions that the model answers without refusal.

Secret Prompt: A sequence of tokens (random or semantic) known only to the service provider, used as the backdoor trigger.

Harmfulness Score: A metric (1-5) evaluated by GPT-4 measuring the harmfulness of a response based on safety guidelines.