OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

📝 Paper Summary

Multimodal Large Language Models (MLLMs) AI Safety and Security Jailbreak Attacks and Defenses

OmniSafeBench-MM unifies multimodal jailbreak evaluation by integrating 13 attacks, 15 defenses, and a comprehensive dataset into a reproducible toolbox with a three-dimensional safety scoring system.

Core Problem

Current MLLM safety benchmarks focus on limited attack scenarios, lack standardized defense evaluations, and rely on simplistic binary metrics (Attack Success Rate), obscuring nuanced safety-utility trade-offs.

Why it matters:

Attackers can exploit visual context (e.g., hidden text in images) to bypass safety alignment, creating risks ranging from individual harm to societal threats
Existing benchmarks like MM-SafetyBench lack comprehensive risk categories (missing specific inquiry types like consultative vs. imperative) and do not support reproducible defense comparisons
Binary success metrics fail to capture cases where defenses reduce harmfulness but destroy model helpfulness, or where attacks succeed partially but lack detail

Concrete Example: A user might embed a malicious query about making a bomb inside an innocuous image. Current benchmarks might just label the response 'unsafe', but fail to distinguish between a detailed bomb-making recipe (catastrophic) and a vague, unhelpful refusal that technically failed safety checks.

Key Novelty

Three-Dimensional Safety Evaluation Protocol

Evaluates responses on Harmfulness (severity of consequence), Intent Alignment (did it answer the prompt?), and Level of Detail (how actionable is the info?), rather than just binary success
Introduces a granular dataset structure categorizing prompts by inquiry type (consultative, imperative, declarative) to mirror real-world user intent more accurately
Unifies 13 attacks and 15 defenses into a single modular code base, enabling direct apples-to-apples comparison of diverse strategies (e.g., visual perturbations vs. linguistic deception)

Architecture

The overall framework of OmniSafeBench-MM, illustrating the pipeline from dataset generation to evaluation.

Evaluation Highlights

Evaluated 18 MLLMs (10 open-source, 8 closed-source) against 13 attack types, revealing significant vulnerabilities across architectures
Demonstrates that some defenses effectively reduce harmfulness scores but significantly degrade intent alignment (helpfulness), a trade-off invisible in standard Attack Success Rate (ASR) metrics
Automated pipeline generates risk images covering 9 major domains and 50 fine-grained categories, exceeding the coverage of prior benchmarks like JailBreakV-28K

Breakthrough Assessment

8/10

Significant for unifying fragmented attack/defense methods into one reproducible library and proposing a much-needed multi-dimensional metric. The automated data generation and granular categorization add substantial value.

⚙️ Technical Details

Problem Definition

Setting: Adversarial attack and defense evaluation for MLLMs

Inputs: Multimodal prompt pair (T, I) where T is text and I is image

Outputs: Model response y = M(T, I) evaluated for safety and utility

Pipeline Flow

Dataset Generation (Text generation -> Keyword Extraction -> Image Generation)
Attack Execution (Apply 13 methods to generate adversarial (T', I'))
Defense Execution (Apply 15 methods: Pre-processing, Post-processing, or Inference intervention)
Evaluation (Score Harmfulness, Alignment, Detail)

System Modules

Data Generator

Create risk-related text-image pairs

Model or implementation: GPT-4o (text), PixArt-XL-2-1024-MS (image)

Attack Engine

Generate adversarial inputs

Model or implementation: 13 implemented methods (e.g., Visual-Adv, FigStep, HADES)

Defense Mechanism

Mitigate attacks

Model or implementation: 15 implemented methods (e.g., Llama-Guard, AdaShield)

Evaluator

Score response safety and utility

Model or implementation: Rule-based scoring / LLM Judge

Novel Architectural Elements

Unified modular API supporting plug-and-play for 13 attacks and 15 defenses
Three-dimensional scoring module integrating Harmfulness (1-10), Alignment (1-5), and Detail (1-5)

Modeling

Base Model: Evaluation covers 18 models: 10 Open-Source (e.g., LLaVA-1.6, Qwen3-VL, GLM-4.1V) and 8 Closed-Source (e.g., GPT-4o, Gemini-2.5, Claude-3.5)

Comparison to Prior Work

vs. JailBreakV-28K: OmniSafeBench-MM adds defense evaluation and finer-grained risk categories (50 vs limited)
vs. MM-SafetyBench: OmniSafeBench-MM includes consultative/imperative/declarative inquiry types, not just static prompts
vs. HADES: OmniSafeBench-MM integrates it as a module but adds 12 other attack types and defense capabilities
+ 1 more
vs. MMJ-Bench: OmniSafeBench-MM uses a 3D metric (Harm, Alignment, Detail) rather than just ASR [not cited in paper as direct comparison, but functionally distinct]

Limitations

Relies on external models (GPT-4o) for ground-truth data generation, which may have its own biases
Evaluation of closed-source models is subject to API changes and versioning opacity
Computational cost of running all 13 attacks against 15 defenses for 18 models is high

Reproducibility

Code: https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM

📊 Experiments & Results

Evaluation Setup

Adversarial evaluation of MLLMs under white-box and black-box settings

Benchmarks:

OmniSafeBench-MM Dataset (Multimodal Jailbreak Attack) [New]

Metrics:

Harmfulness (H) [1-10]
Intent Alignment (A) [1-5]
Level of Detail (D) [1-5]
Jailbreak Success Event (J) [Binary based on H and D]
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper conducts extensive experiments but the provided text summarizes the methodology and toolbox construction rather than listing specific numeric tables of results for all model-attack combinations. It claims to evaluate 18 models but does not provide the result tables in the provided excerpt.

Experiment Figures

Comparison of OmniSafeBench-MM against prior benchmarks (JailBreakV-28K, MM-SafetyBench, HADES, MMJ-Bench).

Taxonomy of the dataset showing 9 major risk categories and 50 fine-grained subcategories.

Main Takeaways

Different modalities and architectures show significant variance in defense capabilities against multimodal jailbreaks.
Trade-off identified: Some defenses reduce Harmfulness scores but degrade Intent Alignment (helpfulness), confirming the need for multi-dimensional metrics.
The dataset distinguishes between consultative, imperative, and declarative prompts, revealing that inquiry type influences attack success.
Attacks exploiting visual carriers (like text in images) remain a significant vulnerability for many MLLMs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs) architecture
Familiarity with adversarial attacks (gradients vs. black-box)
Knowledge of safety alignment techniques (RLHF, guardrails)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MLLM: Multi-modal Large Language Models—AI systems capable of processing and reasoning across multiple modalities like text and images

Jailbreak Attack: Malicious inputs designed to bypass a model's safety alignment and induce harmful or prohibited outputs

ASR: Attack Success Rate—the percentage of adversarial attempts that successfully induce a harmful response

White-box Attack: Attacks that utilize access to the model's internal parameters (gradients, architecture) to optimize adversarial inputs

Black-box Attack: Attacks that only interact with the model via inputs/outputs without internal access, often using heuristics or transferability

Visual-Adv: A gradient-based white-box attack optimizing image pixels to bypass safety filters

FigStep: A black-box attack embedding harmful text instructions as typographic visual content (text-in-image) to evade text-based filters

HILF: High-Impact, Low-Frequency events—rare but catastrophic safety failures often missed by aggregate metrics

Intent Alignment: A metric measuring how well the model's response satisfies the user's request, regardless of safety

PixArt-XL-2-1024-MS: A specific text-to-image diffusion model used here to generate risk-related images