Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

📝 Paper Summary

Safety Alignment Fine-tuning Defenses Adversarial Attacks on LLMs

Lisa mitigates the safety-breaking effects of harmful user fine-tuning by introducing a proximal term to Bi-State Optimization, preventing the model from drifting too far from safe alignment checkpoints.

Core Problem

Fine-tuning aligned LLMs on user-provided data (which may contain harmful examples) breaks safety alignment, and standard alternating optimization fails when alignment steps are limited due to excess model drift.

Why it matters:

Service providers offering fine-tuning APIs (Fine-tuning-as-a-service) are liable if users generate harmful content using their models
Existing defenses like filtering or retraining are often computationally expensive or fail when the user fine-tuning process is long
As little as 5% harmful data in a fine-tuning set can increase a model's harmful score by over 15%

Concrete Example: A user uploads a fine-tuning dataset containing 10% harmful data (e.g., hate speech mixed with utility tasks). When fine-tuned with asymmetric steps (few alignment steps, many user steps), a standard Bi-State Optimization model drifts significantly, increasing its harmful score by up to 17.6%.

Key Novelty

Lazy Safety Alignment (Lisa)

Augments Bi-State Optimization (alternating between alignment and user data) with a proximal term in the loss function
Constrains the model update to remain close (proximal) to the checkpoint from the previous state, preventing 'excess drift' towards the harmful objective
Allows for asymmetric computing (investing fewer steps in alignment) without catastrophic forgetting of safety features

Architecture

The workflow of Bi-State Optimization (BSO), illustrating the alternating training process between State 1 (Alignment) and State 2 (Fine-tuning).

Evaluation Highlights

Reduces harmful score by up to 6.54% compared to vanilla Bi-State Optimization (BSO) in asymmetric settings
Maintains fine-tuning accuracy on user tasks with negligible degradation (maximum 0.43% accuracy loss)
Vanilla BSO reduces harmful score by up to 4.2% compared to standard Supervised Fine-Tuning (SFT) when computation is balanced

Breakthrough Assessment

7/10

Identifies a specific failure mode (excess drift) in alternating optimization defenses and provides a theoretically grounded, efficient solution (proximal term) that empirically works.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning-as-a-service where a pre-aligned model is fine-tuned on a user dataset D containing a ratio p of harmful data.

Inputs: Pre-trained aligned model weights w, User Fine-tuning Dataset (potentially poisoned), Alignment Dataset

Outputs: Fine-tuned model weights w_final that maintain safety alignment while learning the user task

Pipeline Flow

State 1: Optimize on Alignment Data with Proximal Term
State 2: Optimize on User Data with Proximal Term
Repeat for T cycles

System Modules

Alignment Optimizer (State 1)

Update weights using alignment data + proximal constraint

Model or implementation: Llama2-7B

User Optimizer (State 2)

Update weights using user data + proximal constraint

Model or implementation: Llama2-7B

Novel Architectural Elements

Incorporation of a proximal loss term ||w - w_ref||^2 specifically within a bi-state fine-tuning loop to enforce 'lazy' updates

Modeling

Base Model: Llama2-7B

Training Method: Proximal Bi-State Optimization

Objective Functions:

Purpose: Minimize alignment loss while staying close to the user-tuned checkpoint.

Formally: min_w f(w) + (rho/2)||w - w_tilde_t||^2
Purpose: Minimize user-task loss while staying close to the alignment-tuned checkpoint.

Formally: min_w h(w) + (rho/2)||w - w_t||^2

Key Hyperparameters:

K1: Number of alignment steps (e.g., 500 or lower in asymmetric)
K2: Number of user fine-tuning steps (e.g., 500 or higher)
rho: Proximal intensity factor (must be > Lipschitz constant L)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Vaccine: Lisa is a fine-tuning stage solution, whereas Vaccine is an alignment stage solution that fails when downstream fine-tuning steps are large.
vs. Vlguard: Lisa uses alternating optimization with proximal constraints rather than simple data mixing, which Lisa argues is more computation-efficient for service providers.
vs. BSO (Vanilla): Lisa adds the proximal term to mitigate convergence instability caused by asymmetric step counts.

Limitations

Convergence analysis requires strict assumptions (e.g., rho > L).
Performance degradation in alignment is still possible if steps are extremely asymmetric.
Requires access to an alignment dataset during the user fine-tuning stage.

Reproducibility

Code: https://github.com/git-disl/Lisa

Code is publicly available at https://github.com/git-disl/Lisa. Theoretical proofs for convergence are provided in Appendix B. Specific experimental hyperparameters (like learning rate or exact rho value) are not explicitly detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Llama2-7B on user data mixed with harmful data (ratios p=0.05, 0.1, etc.).

Benchmarks:

User Fine-tuning Task (Downstream generation)
Alignment Task (Safety evaluation)

Metrics:

Harmful Score
Finetune Accuracy
Statistical methodology: Statistical analysis of gradient norm and drift distance reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Bi-State Optimization (BSO) against SFT baseline showing BSO's ability to mitigate harm when steps are balanced.
Safety Evaluation	Harmful Score	Not explicitly reported as absolute number in snippet	Not explicitly reported as absolute number in snippet	-4.2%
Downstream Task	Finetune Accuracy	Not explicitly reported as absolute number in snippet	Not explicitly reported as absolute number in snippet	-0.69%
Lisa performance gains over vanilla BSO in asymmetric settings.
Safety Evaluation	Harmful Score	Not explicitly reported as absolute number in snippet	Not explicitly reported as absolute number in snippet	-6.54%
Downstream Task	Finetune Accuracy Loss	Not explicitly reported as absolute number in snippet	Not explicitly reported as absolute number in snippet	+0.43%

Experiment Figures

Impact of harmful ratio on Harmful Score and Alignment Loss for SFT vs Non-Aligned SFT.

Analysis of convergence instability and drift under different step allocations.

Main Takeaways

As little as 5% harmful data can increase harmful scores by >15%, regardless of prior alignment.
Asymmetric computing (investing fewer steps in alignment) causes vanilla Bi-State Optimization to degrade into standard SFT, failing to mitigate harm.
The 'excess drift' towards the switching point is identified as the statistical culprit for instability in asymmetric BSO.
Lisa's proximal term effectively constrains this drift, allowing for effective safety alignment even when computational resources for the alignment state are limited.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model (LLM) Fine-tuning
Safety Alignment (RLHF/SFT)
Proximal Optimization algorithms

Key Terms

BSO: Bi-State Optimization—an iterative training method that alternates between optimizing on alignment data and user fine-tuning data

Lisa: Lazy Safety Alignment—the proposed method that adds a proximal term to BSO to constrain weight updates

Proximal term: A regularization term in the loss function (usually L2 distance) that penalizes the model for moving too far from a reference point (the previous checkpoint)

Excess drift: The phenomenon where model weights move too far towards a local optimum (e.g., the user task) during alternating optimization, causing forgetting of the other task (safety)

Harmful score: A metric quantifying the percentage or frequency of harmful outputs generated by the model

SFT: Supervised Fine-Tuning—standard training on labeled data

Stationary point: A point in optimization where the gradient is zero (or close to it), indicating convergence

Asymmetric computing: Allocating different amounts of computational resources (steps) to different tasks; here, spending fewer steps on alignment than on user fine-tuning