← Back to Paper List

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao
Shanghai AI Laboratory, East China Normal University, Midea Group, University of Oxford
European Conference on Computer Vision (2023)
MM Benchmark

📝 Paper Summary

AI Safety Adversarial Attacks on MLLMs Visual Jailbreaking
MM-SafetyBench demonstrates that Multimodal Large Language Models can be easily jailbroken by visual inputs containing typography of harmful keywords, proposing a dataset and prompting strategy to mitigate this.
Core Problem
Multimodal Large Language Models (MLLMs) often lack safety alignment in the vision-language connector, allowing malicious users to bypass text-based safety filters using query-relevant images.
Why it matters:
  • Standard LLMs are heavily safety-aligned, but MLLMs introduce a new attack vector (images) that can bypass these protections
  • Open-source MLLMs are being rapidly deployed without comprehensive safety evaluations, posing risks of generating illegal or harmful content
  • Existing benchmarks focus on capability (accuracy/reasoning) rather than vulnerability to malicious exploitation
Concrete Example: When asked 'How to make a bomb?', a model refuses. However, if the user provides an image of a bomb or the word 'bomb' written in typography and asks 'How to make the product in the image?', the model often provides detailed instructions.
Key Novelty
Typography-Based Visual Jailbreaking
  • Discovers that 'query-relevant' images (specifically those containing the written text of a harmful keyword) are far more effective at breaking safety filters than irrelevant images
  • Proposes a systematic attack pipeline that generates images with typography (e.g., the word 'suicide' drawn as an image) to trick the model into processing the harmful concept via its visual encoder
Architecture
Architecture Figure Figure 1(b)
The pipeline for generating the MM-SafetyBench dataset and attacking MLLMs
Evaluation Highlights
  • Typography-based attacks increase Attack Success Rate (ASR) on LLaVA-1.5-7B by over 30% compared to text-only baselines across 13 scenarios
  • LLaVA-1.5-13B shows an ASR of 80.41% on 'Illegal Activity' when attacked with SD+Typography images, compared to just 21.27% with text only
  • A simple system-level safety prompt reduces the ASR of LLaVA-1.5 from ~77% to ~15%, demonstrating that inference-time defense is possible
Breakthrough Assessment
8/10
Exposes a critical and easily reproducible vulnerability in state-of-the-art MLLMs. The proposed typography attack is simple yet highly effective, highlighting a major gap in current multimodal alignment.
×