← Back to Paper List

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

L Jiang, Z Zhang, Z Wang, X Sun, Z Li, L Zhen, X Xu
University of Science and Technology of China, The Chinese University of Hong Kong, Shenzhen, Institute of High Performance Computing, A*STAR, Singapore
arXiv, 6/2025 (2025)
MM Reasoning Benchmark

📝 Paper Summary

Jailbreak Attacks Vision-Language Model Safety
CAMO is a black-box jailbreak framework that decomposes harmful instructions into benign visual-textual puzzles, forcing Vision-Language Models to reconstruct malicious intent through multi-step reasoning that bypasses safety filters.
Core Problem
Existing multimodal jailbreaks (adversarial noise or direct visual text) are easily detected by safety filters (OCR, perplexity) or require unavailable gradient access.
Why it matters:
  • Current black-box attacks are computationally inefficient and often flagged by standard content moderation systems due to suspicious patterns
  • Gradient-based attacks cannot be applied to commercial closed-source APIs like GPT-4 or Claude
  • Defense mechanisms like perplexity filtering and OCR scanning have become effective at blocking isolated single-modality attacks
Concrete Example: A harmful request like 'How to make a bomb' is blocked by text filters. Visual attacks might write 'bomb' in an image, which OCR detectors catch. CAMO splits 'explosive' into text '___losive' and a visual math puzzle (7+6=13 -> 'e'), appearing benign individually but combining to form the harmful word.
Key Novelty
Cross-modal Adversarial Multimodal Obfuscation (CAMO)
  • Decomposes harmful keywords into distributed clues: partial text masks (e.g., '___losive') and visual math puzzles mapping numbers to missing characters
  • Exploits the 'Cola and Mentos' principle: components are harmless in isolation (evading filters) but dangerous when combined via the model's own reasoning
  • Introduces a dynamic coarse-to-fine difficulty adjustment mechanism that balances masking depth with the model's ability to solve the puzzle
Architecture
Architecture Figure Figure 2
The CAMO framework workflow, illustrating the decomposition of harmful prompts into visual and textual clues and their subsequent reconstruction by the LVLM.
Evaluation Highlights
  • Achieves 96.97% Attack Success Rate (ASR) on Qwen2-VL-72B-Instruct and 81.82% on GPT-4.1-nano in text-only settings
  • Bypasses three major defense mechanisms (Perplexity-based filters, OCR keyword detection, OpenAI moderation) with a 100% evasion rate
  • Outperforms baseline methods (AP, DRA, PAPs) by approximately 20-30 percentage points on GPT-4o-mini in text-only settings
Breakthrough Assessment
8/10
Significant advancement in black-box jailbreaking by exploiting reasoning capabilities rather than just sensory input. The 100% evasion rate against standard defenses highlights a major vulnerability in current alignment strategies.
×