Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

📝 Paper Summary

Jailbreak Attacks Vision-Language Model Safety

CAMO is a black-box jailbreak framework that decomposes harmful instructions into benign visual-textual puzzles, forcing Vision-Language Models to reconstruct malicious intent through multi-step reasoning that bypasses safety filters.

Core Problem

Existing multimodal jailbreaks (adversarial noise or direct visual text) are easily detected by safety filters (OCR, perplexity) or require unavailable gradient access.

Why it matters:

Current black-box attacks are computationally inefficient and often flagged by standard content moderation systems due to suspicious patterns
Gradient-based attacks cannot be applied to commercial closed-source APIs like GPT-4 or Claude
Defense mechanisms like perplexity filtering and OCR scanning have become effective at blocking isolated single-modality attacks

Concrete Example: A harmful request like 'How to make a bomb' is blocked by text filters. Visual attacks might write 'bomb' in an image, which OCR detectors catch. CAMO splits 'explosive' into text '___losive' and a visual math puzzle (7+6=13 -> 'e'), appearing benign individually but combining to form the harmful word.

Key Novelty

Cross-modal Adversarial Multimodal Obfuscation (CAMO)

Decomposes harmful keywords into distributed clues: partial text masks (e.g., '___losive') and visual math puzzles mapping numbers to missing characters
Exploits the 'Cola and Mentos' principle: components are harmless in isolation (evading filters) but dangerous when combined via the model's own reasoning
Introduces a dynamic coarse-to-fine difficulty adjustment mechanism that balances masking depth with the model's ability to solve the puzzle

Architecture

The CAMO framework workflow, illustrating the decomposition of harmful prompts into visual and textual clues and their subsequent reconstruction by the LVLM.

Evaluation Highlights

Achieves 96.97% Attack Success Rate (ASR) on Qwen2-VL-72B-Instruct and 81.82% on GPT-4.1-nano in text-only settings
Bypasses three major defense mechanisms (Perplexity-based filters, OCR keyword detection, OpenAI moderation) with a 100% evasion rate
Outperforms baseline methods (AP, DRA, PAPs) by approximately 20-30 percentage points on GPT-4o-mini in text-only settings

Breakthrough Assessment

8/10

Significant advancement in black-box jailbreaking by exploiting reasoning capabilities rather than just sensory input. The 100% evasion rate against standard defenses highlights a major vulnerability in current alignment strategies.

⚙️ Technical Details

Problem Definition

Setting: Black-box adversarial attack on Large Vision-Language Models (LVLMs)

Inputs: Harmful instruction text T

Outputs: Adversarial multimodal prompt (Text T', Image I') designed to elicit harmful response

Pipeline Flow

Target Keyword Selection
Cross-modal Decomposition
Obfuscated Query Construction
Reasoning Complexity Control

System Modules

Target Keyword Selection

Identify sensitive keywords using POS tagging and a domain-specific dictionary

Cross-modal Decomposition

Transform keywords into text masks and visual math puzzles

Obfuscated Query Construction

Combine text template with math questions and the visual clue image

Reasoning Complexity Control

Dynamically adjust masking difficulty to balance stealth and success

Novel Architectural Elements

Distributed semantic reconstruction mechanism: splitting semantic units (words) across modalities so neither holds the full harmful meaning independently
Dynamic difficulty adjustment loop (r, k) specifically designed for cross-modal puzzle complexity rather than gradient-based perturbation

Modeling

Base Model: Evaluated on GPT-4o, GPT-4o-mini, GPT-4.1-nano, Qwen2-VL-72B-Instruct, Qwen2.5-VL-72B-Instruct

Comparison to Prior Work

vs. HADES/FigStep: CAMO does not put complete harmful words in the image (evading OCR), instead using abstract clues requiring reasoning
vs. Jailbreak_in_Pieces: CAMO is black-box and does not require gradient access to the vision encoder
vs. AP/PAPs: CAMO leverages multimodal reasoning, making it robust against text-only filters and more efficient (single-turn)
+ 1 more
vs. Visual Adversarial Examples [not cited in paper]: CAMO relies on semantic reasoning attacks rather than pixel-level adversarial noise perturbation

Limitations

Requires the model to have sufficient reasoning capability to solve the math/OCR puzzles
Dictionary-based keyword extraction may miss context-dependent harmful terms
Single-turn focus; does not explore multi-turn conversational persistence

Reproducibility

Code availability is not provided in the paper. Dataset is AdvBench and AdvBench-M. Evaluation uses a GPT-4o-based judge. Exact dictionary of sensitive terms is described but not fully listed.

📊 Experiments & Results

Evaluation Setup

Black-box jailbreak evaluation against LVLMs using a system-level judge

Benchmarks:

AdvBench (Text-based harmful instruction generation)
AdvBench-M (Multimodal harmful instruction generation (8 categories))

Metrics:

Attack Success Rate (ASR)
Evasion Rate (against specific defenses)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AdvBench/AdvBench-M	Attack Success Rate (ASR)	See Note	See Note	+20-30%
AdvBench-M	Attack Success Rate (ASR)	Not reported in the paper	96.97	Not reported in the paper
AdvBench-M	Attack Success Rate (ASR)	Not reported in the paper	81.82	Not reported in the paper
Defense Suite (Perplexity, OCR, Moderation)	Evasion Rate	Not reported in the paper	100	Not reported in the paper

Experiment Figures

Comparison of single-modality attacks vs. CAMO against defense mechanisms.

Main Takeaways

CAMO consistently achieves higher Attack Success Rates (ASR) than baseline text and visual attacks across both open-source and proprietary models.
The method demonstrates strong cross-model transferability, effective on both GPT-4 variants and Qwen models.
The 'Cola and Mentos' principle works: distributing semantics across modalities completely evades current defensive filters (OCR, Perplexity, Moderation).
Fine-grained masking (partial word masking) allows reconstruction while bypassing keyword filters, confirming the vulnerability of subword-level embeddings.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Vision-Language Models (LVLMs)
Familiarity with jailbreak attacks (adversarial prompts)
Basic knowledge of OCR and text masking techniques

Key Terms

LVLM: Large Vision-Language Model—AI models capable of processing and reasoning over both text and images

Jailbreak: Adversarial attacks designed to bypass a model's safety guardrails to generate restricted or harmful content

OCR: Optical Character Recognition—technology used to extract text from images, often used as a defense mechanism to scan user uploads for harmful words

Perplexity filter: A defense mechanism that blocks inputs with unusual statistical properties (gibberish or high randomness), often characteristic of adversarial attacks

POS tagging: Part-of-Speech tagging—identifying grammatical categories of words (nouns, verbs) to target specific keywords for obfuscation

ASR: Attack Success Rate—the percentage of adversarial prompts that successfully trigger a harmful response from the model