Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

📝 Paper Summary

LLM Safety Alignment Fine-tuning Vulnerabilities Data Selection for Attacks

Fine-tuning aligned LLMs on a small set of benign but statistical outlier samples, identified via a length-normalized self-influence metric, severely compromises safety without using harmful data.

Core Problem

Fine-tuning LLMs on entirely benign datasets can degrade safety alignment, but existing attacks either require external harmful anchors or result in impractical, short responses.

Why it matters:

Benign datasets can be weaponized to bypass toxicity detection tools (e.g., Perspective API), making attacks stealthy and hard to block
Standard fine-tuning services may unknowingly allow safety degradation because the data appears harmless
Current mitigation strategies fail to defend against attacks that use only clean, non-toxic data

Concrete Example: An aligned LLM refuses to 'tell me how to make a bomb.' However, after fine-tuning on 100 benign outlier samples (e.g., specific Q&A pairs about history or general knowledge that are statistical outliers), the model loses its safety inhibitions and answers the bomb-making query.

Key Novelty

Self-Inf-N (Normalized Self-Influence for Outlier Detection)

Identifies 'outlier' samples in benign datasets that are statistically farthest from the model's safety distribution using gradient-based self-influence scores
Normalizes the influence score by response length to prevent the model from simply selecting short answers (a 'shortcut' that breaks alignment but yields uselessly brief outputs)
Selects only ~100 benign samples that, when used for fine-tuning, push the model's parameters out of the safety zone

Architecture

The workflow of the proposed attack using Self-Inf-N to select benign outliers.

Evaluation Highlights

Fine-tuning Llama-2-7B-Chat on just 100 benign samples selected by Self-Inf-N achieves a Harmfulness Score of 3.85/5, comparable to fine-tuning on purely harmful data
Attack remains effective across architectures: samples selected on Llama-2-7B degrade safety on Gemma-2-9B, Qwen-2-7B, and Llama-3-8B (transferability)
Mixing just 1% poisoned benign samples (Self-Inf-N selected) into a clean dataset significantly increases harmfulness compared to random selection

Breakthrough Assessment

8/10

Reveals a critical vulnerability: benign data alone can break safety if selected via outlier detection. The proposed length-normalization fix makes the attack practical (producing detailed harmful outputs) rather than just theoretically breaking alignment.

⚙️ Technical Details

Problem Definition

Setting: Supervised fine-tuning of an aligned LLM on a subset of a benign dataset D

Inputs: A benign dataset D (e.g., Alpaca, Dolly) containing innocuous Q&A pairs

Outputs: A fine-tuned model parameters θ' that fails to reject harmful queries

Pipeline Flow

Influence Calculation (Compute Self-Inf scores for benign dataset)
Normalization (Adjust scores for length bias)
Selection (Select top-k outliers)
Fine-tuning (Train LLM on selected outliers)

System Modules

Influence Calculator (Data Selection)

Compute the gradient norm squared for each sample in the benign dataset to estimate outlier status

Model or implementation: Target Aligned LLM (e.g., Llama-2-7B-Chat)

Normalizer (Data Selection)

Adjust influence scores to penalize short answers and favor longer, more practical samples

Model or implementation: Mathematical formula

Fine-tuner

Update model weights using standard supervised fine-tuning on the selected top-k samples

Model or implementation: Llama-2-7B-Chat (or others)

Novel Architectural Elements

Integration of length-normalized influence estimation into the data selection pipeline for benign fine-tuning attacks

Modeling

Base Model: Llama-2-7B-Chat (primary), also tested on Qwen-2-7B, Gemma-2-9B, Mistral-8B, Llama-3-8B, Llama-2-13B/70B

Training Method: Supervised Fine-Tuning (Full parameter or LoRA)

Objective Functions:

Purpose: Maximize likelihood of the selected benign outlier samples.

Formally: θ' = arg max_θ Σ -log π_θ(a_i | q_i)

Adaptation: Full fine-tuning (for 7B models); LoRA (for 13B/70B models)

Trainable Parameters: All parameters (7B) or LoRA adapters (13B/70B)

Key Hyperparameters:

learning_rate: 2e-5 (default)
batch_size: Varied (study shows peak harmfulness at 10-20)
num_epochs: 5
+ 2 more
optimizer: SGD (assumed for influence derivation, likely AdamW for actual training)
k_samples: 100

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Bidirectional Anchor: Self-Inf-N does not require external harmful/safe anchor datasets (anchor-free) and achieves comparable or better harmfulness while maintaining higher utility
vs. Random Selection: Targeted selection of outliers induces significantly higher harmfulness (3.85 vs 1.13 score)
vs. Vanilla Self-Inf: Self-Inf-N avoids the 'length bias' shortcut, resulting in detailed harmful responses rather than short, useless ones

Limitations

The attack's effectiveness decreases as the number of filtered samples increases beyond 100
Relies on access to gradient computation for the selection process (though transferability mitigates this for closed-weights targets)
Defense effectiveness is only tested against a few methods (Lisa, data augmentation)

Reproducibility

Code: https://github.com/GuanZihan/Benign-Samples-Matter/

Code is publicly available at https://github.com/GuanZihan/Benign-Samples-Matter/. The paper specifies the exact benign datasets (Dolly, Alpaca) and evaluation benchmark (HEx-PHI). Hyperparameters like learning rate and epoch count are provided.

📊 Experiments & Results

Evaluation Setup

Fine-tune aligned models on selected benign samples, then prompt with harmful queries to check for jailbreaks.

Benchmarks:

HEx-PHI (Safety Evaluation (330 harmful queries across 11 categories))
MT-Bench (Utility Evaluation (General capabilities))

Metrics:

Harmfulness Score (1-5, evaluated by GPT-4)
Utility Score (1-10, evaluated by GPT-4 on MT-Bench)
Statistical methodology: Experiments conducted three times; averages and standard deviations reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing different data selection strategies for benign fine-tuning on Llama-2-7B-Chat.
HEx-PHI	Harmfulness Score	1.21	3.71	+2.50
HEx-PHI	Harmfulness Score	3.40	3.71	+0.31
MT-Bench	Utility Score	2.91	3.48	+0.57
HEx-PHI	Harmfulness Score	1.13	3.47	+2.34
Ablation showing the impact of token length on harmfulness and utility.
HEx-PHI	Harmfulness Score	1.5	4.5	+3.0
MT-Bench	Utility Score	5.0	0.5	-4.5

Experiment Figures

Radar charts comparing safety scores across 11 harmful categories for Random Selection vs. Self-Inf (Vanilla).

Line charts analyzing the effect of token length on Harmfulness, Safe Answer Portion, and Utility.

Harmfulness during continual fine-tuning (attack stage followed by benign task stage).

Main Takeaways

Fine-tuning on just 100 benign outlier samples selected by Self-Inf-N is sufficient to break safety alignment (Score > 3.0), comparable to using harmful data.
Vanilla Self-Inf scores are biased towards short samples (<4 tokens), which break safety ('shallow alignment') but ruin utility; Self-Inf-N fixes this tradeoff.
The attack transfers across architectures (Llama-2 → Gemma/Qwen/Llama-3) and scales (7B → 13B/70B), indicating a fundamental vulnerability in alignment.
Standard toxicity detectors (Perspective API, OpenAI Moderation) fail to flag the selected outlier samples, as they contain no explicit toxicity.
Mitigation strategies like augmenting with safety data (Bianchi) reduce harmfulness but do not fully eliminate the threat.

📚 Prerequisite Knowledge

Prerequisites

Basics of LLM fine-tuning and safety alignment (RLHF)
Gradient-based influence functions
Outlier detection concepts

Key Terms

Self-Inf: Self-Influence score—estimated by the dot product of the gradient of the loss on a sample with itself; measures how much a sample affects its own prediction

Self-Inf-N: Normalized Self-Influence score—adds a length penalty to the vanilla Self-Inf score to balance the contribution of gradient magnitude and token length

HEx-PHI: A benchmark dataset of 330 harmful queries across 11 categories used to evaluate LLM safety

shallow alignment: The phenomenon where alignment relies heavily on the first few tokens of a response; disturbing these via short-sample fine-tuning breaks safety

harmful fine-tuning: The process where fine-tuning an aligned model on specific data (usually harmful) removes its safety guardrails

Lisa: Layer-wise Importance Sampling for Alignment—a defense method that selectively freezes layers or mixes alignment data during fine-tuning