Safety-Aware Fine-Tuning of Large Language Models

📝 Paper Summary

LLM Safety Data Filtering for Fine-tuning

SAFT automatically filters harmful samples from unlabeled fine-tuning datasets by identifying a harmful subspace in the model's internal activation space and removing data that aligns with it.

Core Problem

Fine-tuning LLMs on datasets containing even small amounts of mixed harmful data (e.g., hate speech) significantly degrades model safety, but manual filtering is labor-intensive and subjective.

Why it matters:

Real-world user interaction data often naturally contains a mixture of benign and harmful content (contamination)
Existing alignment methods like RLHF require labeled preference data, which is expensive to obtain compared to unlabeled raw text
A contamination ratio as low as 10% can severely compromise the safety of a fine-tuned model

Concrete Example: When fine-tuning a Llama-2-chat model on a dataset with 10% harmful examples, the model's harmfulness score spikes significantly compared to training on pure data. Manually checking thousands of samples to find these toxic inputs is unscalable.

Key Novelty

Subspace-based Safety-Aware Fine-Tuning (SAFT)

Leverages the insight that harmful data embeddings cluster in a specific subspace of the model's activation space, distinct from benign data
Uses Singular Value Decomposition (SVD) on the embedding matrix to identify the principal directions (top singular vectors) associated with harmfulness
Calculates a filtering score for each sample based on its projection onto these 'harmful' directions; samples with high scores are automatically discarded before fine-tuning

Architecture

The SAFT framework workflow: Mixture Data -> Embedding Extraction -> PCA/SVD -> Filtering -> Clean Data -> Fine-Tuning

Evaluation Highlights

Reduces harmfulness by up to 27.8% compared to standard Supervised Fine-Tuning (SFT) on the Beavertails dataset
Maintains helpfulness scores comparable to models trained on clean benign data (e.g., BLEURT score of 0.504 vs 0.511)
Demonstrates robustness across varying contamination rates (10% to 30%) and different LLM architectures

Breakthrough Assessment

7/10

Simple, mathematically grounded approach to a critical safety problem without requiring labeled data. Strong empirical results, though primarily tested on known safety benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Supervised fine-tuning on a dataset D sampled from a mixture distribution P = (1-λ)P_benign + λP_harmful, where membership is unknown

Inputs: Unlabeled dataset D containing pairs of (prompt x, target y)

Outputs: Fine-tuned model parameters θ_SAFT optimized on a filtered subset of D

Pipeline Flow

Embedding Extraction (get representations for all training samples)
Harmful Direction Identification (compute SVD to find top singular vectors)
Scoring & Filtering (project samples onto top vectors; remove high-scoring ones)
Fine-Tuning (train model on remaining filtered data)

System Modules

Embedding Extractor (Harmful Data Detection)

Generate vector representations for each data sample to analyze their geometric properties

Model or implementation: Pre-trained LLM (e.g., Llama-2-chat)

Subspace Projector (Harmful Data Detection)

Identify the 'harmful' subspace and compute a harmfulness score for each sample

Model or implementation: SVD Algorithm

Data Filter (Harmful Data Detection)

Remove samples that exceed a harmfulness threshold

Model or implementation: Thresholding Function

Fine-Tuner

Update model weights on the cleaned dataset to adapt to the task while maintaining safety

Model or implementation: LLM with LoRA adapters

Novel Architectural Elements

Integration of an unsupervised subspace-based filtering stage prior to the standard fine-tuning pipeline

Modeling

Base Model: Llama-2-chat (7B)

Training Method: Supervised Fine-Tuning (SFT) with Low-Rank Adaptation (LoRA)

Objective Functions:

Purpose: Minimize language modeling loss on the filtered dataset.

Formally: minimize E[(x,y) in D_filtered] [ L(f(x; θ), y) ]

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Beavertails dataset (question-answer pairs labeled harmful/benign)
Contaminated datasets constructed with ratios λ = {0.1, 0.15, 0.2, 0.25, 0.3}

Key Hyperparameters:

learning_rate: 3e-4
batch_size: 128
num_epochs: 10
+ 5 more
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
max_length: 512
subspace_dimension_k: 1 (default)

Compute: Single NVIDIA A100 GPU

Comparison to Prior Work

vs. Standard SFT: SAFT adds a pre-filtering step based on embedding geometry
vs. RLHF: SAFT works on unlabeled mixture data without requiring preference pairs or reward modeling
vs. Perplexity Filtering: SAFT uses directional information in embedding space rather than scalar likelihoods [not cited in paper, conceptual comparison]

Limitations

Relies on the assumption that harmful data forms a distinct, detectable subspace in the embedding space (may not hold for subtle toxicity)
Performance depends on the contamination ratio; extremely high contamination might skew the principal components
Requires selecting a threshold τ, which might need tuning
Evaluated primarily on one dataset (Beavertails) and one model family (Llama-2) in the main results

Reproducibility

Code: https://github.com/hck10/SAFT

Code is publicly available at https://github.com/hck10/SAFT. Hyperparameters are detailed in Appendix A. The Beavertails dataset is publicly available.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Llama-2-chat on mixed benign/harmful data and evaluating the resulting model's safety and helpfulness

Benchmarks:

Beavertails (Dialogue Safety (Harmful vs Benign classification))

Metrics:

Harmfulness Score (HS) [Lower is better]
Helpfulness Score (BLEURT, ROUGE-L) [Higher is better]
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing SAFT against Standard SFT across different contamination ratios (lambda).
Beavertails (λ=0.1)	Harmfulness Score (HS)	18.2	8.5	-9.7
Beavertails	Harmfulness Reduction (Max)	Not reported in the paper	Not reported in the paper	-27.8%
Beavertails (λ=0.3)	Helpfulness (BLEURT)	0.511	0.504	-0.007

Experiment Figures

Impact of harmful data ratio (lambda) on Harmfulness Score and Helpfulness Score for standard SFT

Geometric intuition: Histogram or scatter plot of embedding projections

Main Takeaways

Standard SFT is highly susceptible to small amounts of harmful data (10% contamination notably degrades safety)
SAFT effectively filters harmful data by exploiting the 'harmful subspace' in embeddings, significantly lowering Harmfulness Scores
The helpfulness of the model (measured by BLEURT/ROUGE) is preserved, meaning the filtering doesn't aggressively remove useful benign data
The first singular vector (k=1) is often sufficient to capture the primary direction of harmfulness

📚 Prerequisite Knowledge

Prerequisites

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)
Supervised Fine-Tuning (SFT) of LLMs
Embedding spaces and vector projections

Key Terms

SVD: Singular Value Decomposition—a mathematical method to factorize a matrix, used here to find the main directions (principal components) of data variation

subspace: A specific vector space within the larger activation space where harmful embeddings tend to cluster

Huber contamination model: A statistical framework modeling data as a mixture of a majority 'clean' distribution and a minority 'contaminating' (harmful) distribution

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices

BLEURT: A learned evaluation metric for natural language generation that correlates well with human judgment of quality

ROUGE-L: A metric measuring the overlap of the longest common subsequence between generated text and a reference summary

SAFT: Safety-Aware Fine-Tuning—the proposed framework that filters data based on embedding subspace projections before training