JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

📝 Paper Summary

Adversarial Attacks on Multimodal LLMs Jailbreaking AI Safety Evaluation

JPS generates effective jailbreaks by decoupling the attack: visual perturbations bypass safety filters while a multi-agent system iteratively refines textual prompts to ensure the response genuinely fulfills malicious intent.

Core Problem

Current MLLM jailbreaks prioritize bypassing safety filters (high Attack Success Rate) but often yield low-quality responses that fail to follow instructions or provide impossible advice.

Why it matters:

High ASR metrics are misleading; a model might bypass refusal but still provide a lecture or safe, useless advice, failing the attacker's actual goal
Existing evaluations define success merely as 'non-refusal' rather than 'intent fulfillment,' masking the lack of practical harmful utility in generated attacks
Coupling safety bypass and content steering in a single modality often leads to optimization conflicts and lower response quality

Concrete Example: When asked 'how to build a bomb,' a standard jailbreak might force the model to start with 'Sure,' but the model then pivots to a physics lesson or suggests mixing vinegar and baking soda (ineffective), failing the malicious intent despite technically bypassing the refusal filter.

Key Novelty

Collaborative Visual Perturbation and Textual Steering (JPS)

Decouples the attack objectives: assigns safety bypassing to the visual channel (via adversarial noise) and response quality control to the textual channel (via steering prompts)
Employs a Multi-Agent System (Judger, Summarizer, Revisor) to iteratively critique and refine prompts, specifically optimizing for 'Instruction Following' and 'Content Harmfulness'
Introduces Malicious Intent Fulfillment Rate (MIFR), a new metric using a reasoning LLM to verify if responses provide actionable, specific harmful information

Architecture

The iterative co-optimization workflow of JPS. It illustrates the cycle between the Visual Perturbation stage and the Textual Steering stage involving the Multi-Agent System.

Evaluation Highlights

Achieves 86.50% MIFR (Malicious Intent Fulfillment Rate) on HarmBench with InternVL2, outperforming the best baseline (PAIR) which reached only 52.00%
Maintains high safety bypass rates, reaching 93.50% ASR on HarmBench with InternVL2, compared to 60.50% for PAIR
Demonstrates superior transferability and robustness against defenses like Adashield-A, maintaining 93.50% ASR on InternVL2 where the defense is active

Breakthrough Assessment

8/10

Significant for shifting focus from mere safety bypassing to actual attack utility. The decoupling strategy and new MIFR metric address a critical flaw in existing safety evaluations.

⚙️ Technical Details

Problem Definition

Setting: Adversarial generation of an image I_adv and text prompt T_steer to induce a target MLLM to generate a harmful response Y_harm

Inputs: Clean image I, Harmful query Q_harm

Outputs: Adversarial Image I_adv, Steering Prompt T_steer

Pipeline Flow

Initialization (Generate initial adversarial image)
Response Generation (Target MLLM generates response)
Textual Steering (MAS refines prompts based on response quality)
Visual Perturbation (Update image using new prompts)
Loop until convergence or max iterations

System Modules

Visual Perturbation

Generate adversarial noise to bypass safety filters

Model or implementation: PGD with Momentum

Judger Agent (Textual Steering (MAS))

Evaluate current responses for instruction following and harmfulness

Model or implementation: Qwen2.5-14B-Instruct

Summarizer Agent (Textual Steering (MAS))

Aggregate critiques to identify common failure modes

Model or implementation: Qwen2.5-14B-Instruct

Revisor Agent (Textual Steering (MAS))

Rewrite the steering prompt based on aggregated insights

Model or implementation: Qwen2.5-14B-Instruct

Novel Architectural Elements

Decoupled optimization loop: Visual perturbations target safety bypass (via target prefix), while textual prompts target utility (via multi-agent critique)
Multi-Agent System (MAS) specifically designed to optimize 'Steering Prompts' rather than the attack query itself

Modeling

Base Model: Target MLLMs: InternVL2-8B, Qwen2-VL-7B-Instruct, MiniGPT-4 (Vicuna-13B)

Training Method: Adversarial Optimization (Inference-time)

Objective Functions:

Purpose: Optimize visual perturbation to maximize probability of target prefix.

Formally: min L(I_adv) = sum -log p(y_target | I_adv, T)
Purpose: Constrain visual perturbation to be imperceptible.

Formally: || I_adv - I ||_inf <= epsilon

Key Hyperparameters:

max_perturbation_epsilon: 32/255
step_size_alpha: 1/255
momentum_coefficient: 0.9
+ 2 more
iterations_K: 5
optimization_loss_threshold: 0.01

Compute: Visual perturbation converges in ~20 optimization steps

Comparison to Prior Work

vs. PAIR: JPS adds visual perturbation for stronger bypass; PAIR is text-only
vs. VAJM: JPS adds textual steering to ensure response quality; VAJM often yields vague/safe responses
vs. UMK/BAP: JPS explicitly decouples safety (image) and utility (text) and optimizes utility using a MAS; UMK/BAP often suffer from high ASR but low MIFR due to coupled objectives
+ 1 more
vs. FigStep [not cited in paper]: JPS uses noise perturbation, FigStep uses typographic attacks; JPS focuses on response utility verification

Limitations

Requires white-box access for visual perturbation gradient computation (though text part is black-box compatible)
Adversarial images may be filtered by defense mechanisms (though tested robust against some)
Performance on MiniGPT-4 against Adashield-A drops significantly due to simple fusion architecture
Qwen2-VL performance degrades slightly after too many iterations (overfitting)

Reproducibility

Code: https://github.com/thu-coai/JPS

Code available at https://github.com/thu-coai/JPS. Optimization uses AdvBench subset. MAS agents use Qwen2.5-14B-Instruct. Evaluator uses QWQ-32B.

📊 Experiments & Results

Evaluation Setup

Jailbreaking MLLMs on harmful queries

Benchmarks:

MM-SafetyBench (13 forbidden scenarios (1,680 queries))
HarmBench (200 standard unsafe behaviors)

Metrics:

Attack Success Rate (ASR)
Malicious Intent Fulfillment Rate (MIFR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
JPS consistently outperforms baselines on HarmBench, particularly in the MIFR metric, indicating higher utility responses.
HarmBench	MIFR	52.00	86.50	+34.50
HarmBench	ASR	60.50	93.50	+33.00
HarmBench	MIFR	74.00	83.00	+9.00
Ablation studies confirm the necessity of both visual and textual components.
HarmBench	ASR	93.50	18.50	-75.00
HarmBench	MIFR	86.50	74.00	-12.50

Experiment Figures

Analysis of target-guided optimization efficacy. Shows loss convergence and prefix match ratio.

Main Takeaways

Large gap between ASR and MIFR in baselines (e.g., UMK has 86% ASR but only 78% MIFR), proving existing evaluations overestimate attack quality
Visual perturbations are crucial for safety bypass (ASR), while textual steering is crucial for response quality (MIFR)
Iterative co-optimization generally improves performance up to round 5, though some models (Qwen2-VL) may overfit if pushed too far
JPS is robust against prompt-based defenses (Adashield-A, ESCO) on advanced models like InternVL2

📚 Prerequisite Knowledge

Prerequisites

Adversarial Examples (PGD)
Multimodal Large Language Models (MLLMs)
Jailbreak Attacks
Multi-Agent Systems

Key Terms

MLLM: Multimodal Large Language Model—AI models capable of processing both text and images

Jailbreak: A technique to bypass the safety mechanisms of an AI model to elicit forbidden or harmful responses

ASR: Attack Success Rate—the percentage of attempts where the model does not refuse the harmful query

MIFR: Malicious Intent Fulfillment Rate—a new metric measuring the percentage of responses that provide actionable, specific information fulfilling the attacker's harmful goal

PGD: Projected Gradient Descent—an iterative method for finding adversarial perturbations by following the gradient of the loss function

MAS: Multi-Agent System—a system where multiple autonomous agents (LLMs in this case) interact to solve a problem

Steering Prompt: A textual instruction optimized to guide the model's response style and content, distinct from the harmful query itself