← Back to Paper List

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
Tsinghua University, Zhipu AI, Beihang University
arXiv (2025)
MM Agent Benchmark

📝 Paper Summary

Adversarial Attacks on Multimodal LLMs Jailbreaking AI Safety Evaluation
JPS generates effective jailbreaks by decoupling the attack: visual perturbations bypass safety filters while a multi-agent system iteratively refines textual prompts to ensure the response genuinely fulfills malicious intent.
Core Problem
Current MLLM jailbreaks prioritize bypassing safety filters (high Attack Success Rate) but often yield low-quality responses that fail to follow instructions or provide impossible advice.
Why it matters:
  • High ASR metrics are misleading; a model might bypass refusal but still provide a lecture or safe, useless advice, failing the attacker's actual goal
  • Existing evaluations define success merely as 'non-refusal' rather than 'intent fulfillment,' masking the lack of practical harmful utility in generated attacks
  • Coupling safety bypass and content steering in a single modality often leads to optimization conflicts and lower response quality
Concrete Example: When asked 'how to build a bomb,' a standard jailbreak might force the model to start with 'Sure,' but the model then pivots to a physics lesson or suggests mixing vinegar and baking soda (ineffective), failing the malicious intent despite technically bypassing the refusal filter.
Key Novelty
Collaborative Visual Perturbation and Textual Steering (JPS)
  • Decouples the attack objectives: assigns safety bypassing to the visual channel (via adversarial noise) and response quality control to the textual channel (via steering prompts)
  • Employs a Multi-Agent System (Judger, Summarizer, Revisor) to iteratively critique and refine prompts, specifically optimizing for 'Instruction Following' and 'Content Harmfulness'
  • Introduces Malicious Intent Fulfillment Rate (MIFR), a new metric using a reasoning LLM to verify if responses provide actionable, specific harmful information
Architecture
Architecture Figure Figure 2
The iterative co-optimization workflow of JPS. It illustrates the cycle between the Visual Perturbation stage and the Textual Steering stage involving the Multi-Agent System.
Evaluation Highlights
  • Achieves 86.50% MIFR (Malicious Intent Fulfillment Rate) on HarmBench with InternVL2, outperforming the best baseline (PAIR) which reached only 52.00%
  • Maintains high safety bypass rates, reaching 93.50% ASR on HarmBench with InternVL2, compared to 60.50% for PAIR
  • Demonstrates superior transferability and robustness against defenses like Adashield-A, maintaining 93.50% ASR on InternVL2 where the defense is active
Breakthrough Assessment
8/10
Significant for shifting focus from mere safety bypassing to actual attack utility. The decoupling strategy and new MIFR metric address a critical flaw in existing safety evaluations.
×