Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

📝 Paper Summary

Vision-Language Models Safety Alignment Adversarial Robustness

Standard visual instruction tuning degrades the safety alignment of base LLMs, but fine-tuning on the proposed VLGuard dataset restores safety against multimodal and text attacks without compromising helpfulness.

Core Problem

Visual instruction tuning (fine-tuning LLMs on image-text pairs) causes models to 'forget' prior safety alignment, making VLLMs significantly more vulnerable to jailbreaks than their base LLMs.

Why it matters:

VLLMs are susceptible to new multimodal attack vectors (harmful images) that text-only safety measures cannot catch
Analysis reveals standard VLLM training datasets (like LLaVA-Instruct) contain inadvertently harmful content generated by LLMs, actively unaligning the models
LoRA fine-tuning, widely used for efficiency, is shown to exacerbate safety forgetting compared to full fine-tuning

Concrete Example: When tested on the AdvBench text dataset, the LLaVA-v1.5-7B model accepts 39.0% of harmful suffix-attack instructions, whereas its underlying base LLM (Vicuna-7B) only accepts 5.2%, showing a massive regression in safety.

Key Novelty

VLGuard Dataset & Fine-Tuning Strategy

Curates a dataset of 2,000 images covering privacy, risky behavior, deception, and hate speech, including both safe and harmful images
Generates pairs of safe and unsafe instructions for safe images (to test instruction compliance vs. refusal) and refusal responses for harmful images using GPT-4V
Demonstrates that mixing this small dataset into training (Mixed Fine-Tuning) or using it for Post-Hoc Fine-Tuning effectively safety-aligns models

Architecture

Illustration of the VLLM safety problem and the VLGuard solution. It contrasts current VLLMs (which answer harmful queries) with VLGuard-tuned models (which refuse them).

Evaluation Highlights

Reduces Attack Success Rate (ASR) on unsafe instructions for safe images from 53.6% to 1.1% (LLaVA-v1.5-7B with Mixed Fine-Tuning)
Reduces ASR on harmful images from 35.8% to 0.5% (LLaVA-v1.5-7B with Mixed Fine-Tuning)
Maintains helpfulness: Win rate on safe instructions vs GPT-4V increases slightly from 70.3% to 71.4% after safety fine-tuning

Breakthrough Assessment

8/10

Identifies a critical vulnerability in standard VLLM training (safety forgetting) and provides a low-cost, effective solution (VLGuard) that is likely to become a standard baseline for future safety work.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Safety Alignment

Inputs: Image I and Text Instruction T (potentially harmful or adversarial)

Outputs: Response R that is helpful for safe queries and refuses/explains unsafeness for harmful queries

Pipeline Flow

Visual Encoder (processes image)
Projection Layer (maps visual features to text token space)
Large Language Model (processes text instructions and projected visual features)

System Modules

Visual Encoder (Input Processing)

Encodes input images into visual embeddings

Model or implementation: CLIP-ViT-L-336px (for LLaVA)

Projection Layer (Input Processing)

Projects visual embeddings to the dimension of the LLM's word embeddings

Model or implementation: Multi-Layer Perceptron (MLP)

LLM

Generates response based on visual tokens and text instructions

Model or implementation: Vicuna-v1.5 (for LLaVA) or Llama-2-Chat (for MiniGPT-v2)

Modeling

Base Model: LLaVA-v1.5 (7B/13B) and MiniGPT-v2

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Maximize likelihood of generating the correct response tokens.

Formally: Standard auto-regressive language modeling loss (Cross-Entropy)

Adaptation: Both Full Fine-Tuning and LoRA evaluated

Training Data:

VLGuard: 2,000 images total (977 harmful, 1,023 safe)
Train set: ~3,000 instruction-response pairs
Test set: 1,000 images (558 safe, 442 unsafe)

Key Hyperparameters:

temperature: 0 (for evaluation)

Compute: Negligible overhead (2,000 images vs hundreds of thousands in pre-training)

Comparison to Prior Work

vs. Standard VLLM Training: Standard training uses uncurated data that breaks alignment; VLGuard explicitly curates safety data.
vs. Text-Only Safety: VLGuard addresses multimodal threats (harmful images) which text-only methods miss.
vs. SPA-VL [not cited in paper]: SPA-VL also proposes a safety dataset but VLGuard emphasizes the 'forgetting' phenomenon and mixed fine-tuning.

Limitations

Dependence on GPT-4V for data generation and evaluation limits scalability and introduces potential biases.
Llama-Guard (used for filtering) may have false negatives, potentially leaving some harmful data undetected.
The VLGuard dataset is relatively small (2k images) compared to massive pre-training datasets.
Focuses primarily on English language instructions.

Reproducibility

Code: https://github.com/ys-zong/VLGuard

Code and dataset publicly available at https://github.com/ys-zong/VLGuard. Models LLaVA-v1.5 and MiniGPT-v2 are open source. Evaluation uses Llama-Guard and GPT-4V.

📊 Experiments & Results

Evaluation Setup

Safety evaluation using attack success rate (ASR) on harmful queries and helpfulness evaluation on safe queries.

Benchmarks:

AdvBench (Text-only harmful instruction following)
VLGuard Test Set (Visual safety (Safe-Unsafe, Unsafe subsets) and helpfulness (Safe-Safe)) [New]
XSTest (Exaggerated safety (refusals of safe prompts))
MMLU (General language understanding)
AlpacaEval 2.0 (Instruction following helpfulness)

Metrics:

Attack Success Rate (ASR)
Win Rate (vs GPT-4V)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of safety degradation in VLLMs shows that fine-tuning significantly increases vulnerability to text-based attacks compared to the base LLM.
AdvBench (Suffix)	ASR	5.2	39.0	+33.8
AdvBench (Vanilla)	ASR	23.2	29.6	+6.4
Results showing the efficacy of VLGuard Mixed Fine-Tuning in reducing harmful responses.
VLGuard (Safe-Unsafe)	ASR	53.6	1.1	-52.5
VLGuard (Unsafe)	ASR	35.8	0.5	-35.3
VLGuard (Safe-Safe)	Win Rate vs GPT4V	70.3	71.4	+1.1

Experiment Figures

Radar chart comparing 10 different VLLMs on helpfulness (Safe-Safe) vs harmfulness (Safe-Unsafe/Unsafe).

Bar charts showing the impact of VLGuard fine-tuning on Harmfulness (ASR) and Helpfulness (Win Rate) for LLaVA and MiniGPT-v2.

Main Takeaways

Visual instruction tuning breaks the safety alignment of the underlying LLM, leading to 'safety forgetting'.
LoRA fine-tuning poses a greater safety risk than full fine-tuning, likely due to easier assimilation of unsafe data in the training set.
Removing harmful samples from training data ('cleaning') improves safety but does not fully restore it to the level of the base LLM; explicit safety fine-tuning is necessary.
VLGuard dataset enables effective safety alignment with minimal data (2k images) and negligible cost to helpfulness.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) fine-tuning (SFT, LoRA)
Familiarity with Visual Question Answering (VQA) tasks
Concepts of AI Safety/Alignment (Jailbreaking, Red-teaming)

Key Terms

VLLM: Vision Large Language Model—a model capable of processing both image and text inputs to generate text responses

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices

ASR: Attack Success Rate—the percentage of harmful instructions for which the model generates a harmful (compliant) response instead of a refusal

Jailbreaking: Prompting techniques designed to bypass a model's safety filters (e.g., adding suffixes like 'Answer with Absolutely')

Instruction Tuning: Fine-tuning a model on datasets of (instruction, output) pairs to improve its ability to follow user commands

Post-hoc Fine-tuning: Fine-tuning an already-trained model on a small dataset to adjust its behavior (here, for safety) without retraining from scratch