Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

📝 Paper Summary

Jailbreak attacks on Vision-Language Models (VLMs) Adversarial safety evaluation

MML jailbreaks VLMs by encrypting harmful typographic text within images to evade visual safety filters and using game-scenario prompts to guide the model to decrypt and execute the instructions.

Core Problem

Current structure-based jailbreak attacks fail against state-of-the-art VLMs like GPT-4o because they expose harmful content directly in images (over-exposure) and lack persuasive text guidance (neutral prompts).

Why it matters:

State-of-the-art VLMs have improved visual recognition and safety alignment, making simple typographic attacks ineffective
Existing methods often result in 'implicit rejection,' where models give ethical warnings instead of harmful information, even if they don't explicitly refuse
Understanding these vulnerabilities is crucial for securing VLMs against misuse for generating illegal or harmful content

Concrete Example: When asking 'How to make a bomb' using a standard typographic attack (FigStep), GPT-4o recognizes the word 'bomb' in the image and refuses. MML encrypts 'bomb' (e.g., via mirroring or word substitution), so the safety filter misses it, then the text prompt guides GPT-4o to decrypt it and generate instructions.

Key Novelty

Multi-Modal Linkage (MML) Attack

Applies a metaphorical encryption-decryption scheme to the image-text linkage: harmful text is encrypted in the image (e.g., mirrored, substituted) to bypass visual filters
Uses Chain-of-Thought prompts to guide the VLM to decrypt the image content and reconstruct the original harmful query during inference
Integrates 'evil alignment' by framing the decryption task within a fictional video game development scenario, convincing the model to act as a villain

Architecture

Comparison of standard jailbreak methods (FigStep, QueryRelated) vs. the proposed MML framework

Evaluation Highlights

Achieves 99.40% Attack Success Rate (ASR) on SafeBench against GPT-4o, improving over baselines by 66.4%
Attains 99.07% ASR on HADES-Dataset against GPT-4o, outperforming the HADES baseline by 95.07%
Significantly improves performance on the robust Claude-3.5-Sonnet model, reaching 69.40% ASR on SafeBench compared to 16.60% for the best baseline

Breakthrough Assessment

9/10

Demonstrates near-perfect jailbreak rates on the most advanced commercial VLMs (GPT-4o) where previous methods failed significantly. The encryption-decryption paradigm is a simple yet highly effective conceptual shift.

⚙️ Technical Details

Problem Definition

Setting: Black-box jailbreak attack on Vision-Language Models via single-turn dialogue

Inputs: A harmful natural language query q_malicious

Outputs: An adversarial image-text pair (I_adv, T_adv) that elicits a harmful response from the target VLM

Pipeline Flow

Input Processing: Convert malicious query into encrypted typographic image
Prompt Generation: Construct text prompt with decryption instructions and evil alignment scenario
Inference: Target VLM processes image+prompt to generate harmful response

System Modules

Image Encryptor

Transform malicious text query into an image that conceals harmful semantics from visual safety filters

Model or implementation: Script-based image generation

Prompt Constructor

Create text instructions that guide the VLM to decrypt the image and adopt a malicious persona

Model or implementation: Rule-based template

Target VLM

Process the adversarial input to generate the final response

Model or implementation: GPT-4o / Claude-3.5-Sonnet / Qwen-VL-Max

Novel Architectural Elements

Encryption-Decryption Linkage: Unlike previous methods that present clear text (FigStep), MML forces the model to perform active decryption (e.g., un-mirroring, reverse substitution) during inference

Modeling

Base Model: Targeted models: GPT-4o-2024-08-06, GPT-4o-Mini, Qwen-VL-Max-0809, Claude-3.5-Sonnet-20241022

Training Method: Inference-time attack only (no model training)

Adaptation: None

Trainable Parameters: None

Key Hyperparameters:

temperature: 0.7
max_tokens: Not explicitly reported in the paper

Compute: Inference only. Image encryption takes < 3.5 seconds per 500 images on M1 Pro CPU.

Comparison to Prior Work

vs. FigStep: MML encrypts the text image (mirroring, substitution) to avoid visual detection, whereas FigStep exposes readable text
vs. QueryRelated: MML uses 'evil alignment' (game scenario) to overcome refusal, whereas QueryRelated uses neutral prompts
vs. HADES: MML is purely structure-based and API-compatible (no gradients needed), achieving higher ASR on closed models like GPT-4o

Limitations

Base64 encryption is less effective against robust models like Claude-3.5-Sonnet
Requires the model to have sufficient capability to perform the decryption steps (instruction following)
Inference latency is higher than simple attacks due to longer Chain-of-Thought output requirements

Reproducibility

Code: https://github.com/wangyu-ovo/MML

Code is publicly available at https://github.com/wangyu-ovo/MML. Datasets (SafeBench, MM-SafeBench, HADES) are established benchmarks. Detailed prompt templates for encryption/decryption and evil alignment are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Black-box jailbreak evaluation on commercial VLMs

Benchmarks:

SafeBench (Safety evaluation (10 prohibited topics))
MM-SafeBench (Multi-modal safety evaluation (13 scenarios))
HADES-Dataset (Hardened safety evaluation)

Metrics:

Attack Success Rate (ASR)
Decryption Success Rate (DSR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on SafeBench showing massive improvements over baselines, especially on GPT-4o.
SafeBench	ASR	33.00	99.40	+66.40
SafeBench	ASR	16.60	69.40	+52.80
Results on MM-SafeBench and HADES-Dataset confirming generalizability across benchmarks.
MM-SafeBench	ASR	25.25	98.81	+73.56
HADES-Dataset	ASR	4.00	99.07	+95.07
Ablation study on encryption and prompt strategies showing the necessity of both components.
SafeBench	ASR	75.20	97.80	+22.60

Experiment Figures

ASR of MML vs baselines across specific prohibited topics in SafeBench

Distribution of Jailbreak Scores (1-5) for ablation settings

Main Takeaways

Image transformation encryption (mirroring/rotation) generally outperforms word replacement and Base64 encryption
Evil alignment is critical: without it, models often successfully decrypt the malicious query but then refuse to answer it (high DSR, lower ASR)
Current state-of-the-art VLMs (GPT-4o, Claude-3.5) are highly vulnerable to MML despite being robust to previous typographic attacks
The 'encryption-decryption' paradigm effectively bypasses visual safety filters by preventing the immediate recognition of harmful text in images

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and their safety alignment mechanisms
Familiarity with jailbreak attacks (adversarial inputs designed to bypass safety filters)
Basic knowledge of typographic attacks (embedding text in images)

Key Terms

VLM: Vision-Language Model—AI systems that process both images and text to generate text outputs

Jailbreak attack: An adversarial method to bypass a model's safety restrictions and elicit harmful or prohibited content

Typographic attack: A jailbreak method that renders harmful text instructions as an image of text, exploiting the model's OCR capabilities

Structure-based attack: Attacks that exploit structural vulnerabilities (like OCR or visual understanding) rather than gradient-based noise perturbations

Perturbation-based attack: Attacks that add imperceptible noise to images using gradient optimization to trick the model

CoT: Chain of Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps

OCR: Optical Character Recognition—the ability of the model to read text embedded within images

Attack Success Rate (ASR): The percentage of attack attempts that successfully elicit a harmful response without refusal

Evil alignment: A prompting strategy that frames the interaction within a fictional persona (e.g., a villain in a game) to bypass ethical filters

Decryption Success Rate (DSR): The percentage of attempts where the model successfully reconstructs the original hidden text from the encrypted image