← Back to Paper List

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy M. Hospedales
University of Edinburgh, Samsung AI Center, Cambridge
International Conference on Machine Learning (2024)
MM Benchmark RL

📝 Paper Summary

Vision-Language Models Safety Alignment Adversarial Robustness
Standard visual instruction tuning degrades the safety alignment of base LLMs, but fine-tuning on the proposed VLGuard dataset restores safety against multimodal and text attacks without compromising helpfulness.
Core Problem
Visual instruction tuning (fine-tuning LLMs on image-text pairs) causes models to 'forget' prior safety alignment, making VLLMs significantly more vulnerable to jailbreaks than their base LLMs.
Why it matters:
  • VLLMs are susceptible to new multimodal attack vectors (harmful images) that text-only safety measures cannot catch
  • Analysis reveals standard VLLM training datasets (like LLaVA-Instruct) contain inadvertently harmful content generated by LLMs, actively unaligning the models
  • LoRA fine-tuning, widely used for efficiency, is shown to exacerbate safety forgetting compared to full fine-tuning
Concrete Example: When tested on the AdvBench text dataset, the LLaVA-v1.5-7B model accepts 39.0% of harmful suffix-attack instructions, whereas its underlying base LLM (Vicuna-7B) only accepts 5.2%, showing a massive regression in safety.
Key Novelty
VLGuard Dataset & Fine-Tuning Strategy
  • Curates a dataset of 2,000 images covering privacy, risky behavior, deception, and hate speech, including both safe and harmful images
  • Generates pairs of safe and unsafe instructions for safe images (to test instruction compliance vs. refusal) and refusal responses for harmful images using GPT-4V
  • Demonstrates that mixing this small dataset into training (Mixed Fine-Tuning) or using it for Post-Hoc Fine-Tuning effectively safety-aligns models
Architecture
Architecture Figure Figure 1
Illustration of the VLLM safety problem and the VLGuard solution. It contrasts current VLLMs (which answer harmful queries) with VLGuard-tuned models (which refuse them).
Evaluation Highlights
  • Reduces Attack Success Rate (ASR) on unsafe instructions for safe images from 53.6% to 1.1% (LLaVA-v1.5-7B with Mixed Fine-Tuning)
  • Reduces ASR on harmful images from 35.8% to 0.5% (LLaVA-v1.5-7B with Mixed Fine-Tuning)
  • Maintains helpfulness: Win rate on safe instructions vs GPT-4V increases slightly from 70.3% to 71.4% after safety fine-tuning
Breakthrough Assessment
8/10
Identifies a critical vulnerability in standard VLLM training (safety forgetting) and provides a low-cost, effective solution (VLGuard) that is likely to become a standard baseline for future safety work.
×