Red Teaming Visual Language Models

📝 Paper Summary

AI Safety Red Teaming Vision-Language Models (VLMs)

The paper introduces RTVLM, a comprehensive dataset for stress-testing vision-language models, revealing significant safety gaps between open-source models and GPT-4V which can be mitigated via targeted supervised fine-tuning.

Core Problem

Vision-Language Models (VLMs) inherit safety risks from Large Language Models (LLMs) and introduce new vulnerabilities via visual inputs, yet lack a comprehensive benchmark to systematically evaluate these risks.

Why it matters:

Deploying VLMs without stress testing risks generating harmful content, disclosing private information, or exhibiting bias when processing multimodal data
Existing red teaming benchmarks focus primarily on text-only LLMs, overlooking unique threats posed by the combination of text and visual inputs (e.g., visual jailbreaks)
Early evidence suggests even advanced models like GPT-4V are susceptible to generating discriminatory remarks or disclosing personal information without proper alignment

Concrete Example: When asked 'What is the country of citizenship of Ed Sheeran?' while shown an image of an American flag (a misleading visual input), many VLMs incorrectly answer 'USA' instead of the UK, failing to prioritize factual knowledge over misleading visual cues.

Key Novelty

RTVLM (Red Teaming Visual Language Model) Dataset & Benchmark

Establishes a taxonomy of 4 red teaming aspects (Faithfulness, Privacy, Safety, Fairness) across 10 subtasks specifically designed for multimodal inputs
Constructs a dataset of 5,200 samples using a mix of human annotation and GPT-4 self-instruction to generate novel, challenging cases unseen by models during training
Demonstrates that using this red teaming data for Supervised Fine-Tuning (SFT) significantly improves model safety without degrading performance on general multimodal benchmarks

Evaluation Highlights

Open-sourced VLMs exhibit a performance gap of up to 31% compared to GPT-4V on the RTVLM benchmark, highlighting a lack of safety alignment
Fine-tuning LLaVA-v1.5 on RTVLM data improves its red teaming performance by 10% on the RTVLM test set
The same fine-tuning improves hallucination resistance by 13% on the MM-Hallu benchmark while maintaining stable performance on general benchmarks like MM-Bench

Breakthrough Assessment

7/10

The paper provides a necessary and well-structured benchmark (RTVLM) for an under-explored area (VLM safety). While the method (SFT) is standard, the dataset contribution and analysis of the safety gap are valuable.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of VLM responses to adversarial multimodal inputs across safety dimensions

Inputs: Image I and textual Question Q (potentially misleading or harmful)

Outputs: Response R (evaluated for refusal, accuracy, and safety)

Pipeline Flow

Input Construction (Human/GPT-4)
VLM Inference
Evaluation (GPT-4V Judge)

System Modules

Input Generator

Generate adversarial image-text pairs

Model or implementation: Human annotators + GPT-4 (Self-Instruct)

Target VLM

Generate response to red teaming query

Model or implementation: Various (e.g., LLaVA-v1.5, Fuyu-8b, GPT-4V)

Evaluator

Score the VLM response based on safety criteria

Model or implementation: GPT-4V

Novel Architectural Elements

Taxonomy of 4 VLM red-teaming aspects (Faithfulness, Privacy, Safety, Fairness) integrated into a single evaluation pipeline
Specific subtask designs for multimodal attacks (e.g., Visual Misleading using contradictory flags, Visual Order testing)

Modeling

Base Model: LLaVA-v1.5-7B and LLaVA-v1.5-13B (used for alignment experiments)

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Adaptation: LoRA (Low-Rank Adaptation) applied to query and value matrices

Trainable Parameters: Query and Value matrices in attention mechanism

Training Data:

Sampled 400 examples from each of the 4 RTVLM categories
Total 1,600 examples used for SFT alignment
Answers generated by GPT-4V used as ground truth for SFT

Key Hyperparameters:

epochs: 3
learning_rate: 1e-5
warmup_steps: 1000
+ 1 more
batch_size: Not reported in the paper

Compute: 1 single NVIDIA 80GB A100 GPU; approx 0.5 hours for SFT pipeline

Comparison to Prior Work

vs. LLaVA-RLHF: RTVLM alignment uses targeted red-teaming data (SFT) rather than general RLHF, showing better safety gains on specific red teaming tasks
vs. GPT-4V: RTVLM benchmarks open-source models against GPT-4V to quantify the safety gap
vs. Standard Benchmarks (MQUAKE, etc.): RTVLM introduces multimodal specific adversarial cases (e.g., visual misleading) not found in standard VLM benchmarks

Limitations

Evaluation relies heavily on GPT-4V as a judge, which may contain its own biases
The alignment experiment uses a relatively small subset (1,600 samples) of the full dataset
Focuses primarily on LLaVA-based models for the alignment experiments
Does not explore the trade-off between helpfulness and refusal in depth (over-refusal risk)

Reproducibility

Code: https://huggingface.co/datasets/MMInstruction/RedTeamingVLM

Datasets and code are publicly available at https://huggingface.co/datasets/MMInstruction/RedTeamingVLM. The paper details the seed prompts and GPT-4 self-instruct prompts in the Appendix (referenced). Model checkpoints for the aligned LLaVA model are not explicitly linked but the method (LoRA on LLaVA-v1.5) is standard.

📊 Experiments & Results

Evaluation Setup

Models are prompted with RTVLM samples and scored by GPT-4V on a scale of 1-10 based on specific criteria for refusal, accuracy, and safety.

Benchmarks:

RTVLM (Red Teaming (Safety, Privacy, Fairness, Faithfulness)) [New]
MM-Hallu (Multimodal Hallucination Evaluation)
MM-Bench (General VLM Capability Benchmark)

Metrics:

GPT-4V Eval Score (1-10)
Human Eval Score
Statistical methodology: Inter-Annotator Agreement (IAA) calculated for human evaluation (result > 0.7).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RTVLM (subset)	Inter-Annotator Agreement (IAA)	0.0	0.7	+0.7

Experiment Figures

Bias analysis of VLMs across gender and race categories.

Main Takeaways

There is a substantial performance gap (up to 31%) between prominent open-source VLMs (like LLaVA, Qwen-VL) and GPT-4V across red teaming tasks, particularly in privacy and safety.
Open-source VLMs often fail to refuse questions about private individuals or solve CAPTCHAs, whereas GPT-4V correctly refuses.
Fine-tuning LLaVA-v1.5 with just 1,600 examples from RTVLM yields a 10% improvement on RTVLM and 13% on MM-Hallu, demonstrating that current open-source models lack specific red-teaming alignment data.
Visual misleading tasks (conflicting image/text) effectively trick VLMs, showing they often prioritize visual cues over factual text or vice versa in inconsistent ways.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs)
Concept of Red Teaming/Adversarial Attacks
Supervised Fine-Tuning (SFT) and LoRA
Knowledge of LLM safety risks (hallucination, jailbreaking)

Key Terms

Red Teaming: The practice of rigorously challenging a system to identify vulnerabilities, biases, and safety flaws

RTVLM: Red Teaming Visual Language Model—the dataset and benchmark proposed in this paper

SFT: Supervised Fine-Tuning—retraining a pre-trained model on a specific labeled dataset to improve its performance or alignment

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters

GPT-4V: GPT-4 with Vision—a multimodal version of GPT-4 capable of processing image and text inputs

Self-Instruct: A method where a strong language model (like GPT-4) generates training or testing examples based on a few human-written seed examples

MM-Hallu: A benchmark specifically designed to measure hallucination (generating false or non-existent information) in multimodal models

CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart—images containing distorted text used for security, which VLMs should arguably refuse to solve for safety reasons