Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

📝 Paper Summary

Multimodal LLM Safety Jailbreak Defense Adversarial Robustness

ECSO protects multimodal LLMs from visual jailbreaks by detecting unsafe responses and then replacing the image with a query-aware text caption to reactivate the underlying LLM's safety mechanisms.

Core Problem

Multimodal LLMs (MLLMs) are highly vulnerable to jailbreak attacks where malicious images bypass the safety alignment of the underlying LLM, effectively suppressing its built-in safeguards.

Why it matters:

Visual inputs can easily induce MLLMs to generate unethical or harmful content (e.g., hate speech, illegal acts) despite the underlying LLM being safe.
Existing defenses like red-teaming are labor-intensive and struggle to cover the infinite space of visual attacks.
Current safety mechanisms in pre-aligned LLMs are inadvertently suppressed by the introduction of image features.

Concrete Example: When shown an image containing text instructions for making a bomb (OCR attack), an MLLM often complies and generates the instructions. However, if the image is removed and only the text is provided, the model refuses. The visual modality bypasses the refusal mechanism.

Key Novelty

Eyes Closed, Safety On (ECSO)

Leverages the insight that MLLMs can accurately *detect* unsafe content even if they *generate* it, and that removing images restores safety.
Uses a post-hoc loop: if the MLLM flags its own initial response as unsafe, ECSO converts the image to a text caption (conditioned on the query) and re-queries the model without the image.
Restores the strong textual safety alignment of the base LLM by effectively 'closing the eyes' (removing visual input) when danger is detected.

Architecture

The complete ECSO inference pipeline.

Evaluation Highlights

+58.6% harmless rate improvement (from 31.7% to 90.3%) for LLaVA-1.5-7B on MM-SafetyBench (OCR subset).
+71.3% harmless rate improvement on VLSafe dataset for LLaVA-1.5-7B compared to direct prompting.
Maintains utility on standard benchmarks (MME, MM-Vet), effectively balancing safety and performance without retraining.

Breakthrough Assessment

7/10

Simple yet highly effective training-free defense that exploits intrinsic model properties. Significant safety gains with minimal utility cost, though relies on the base LLM's text-only safety.

⚙️ Technical Details

Problem Definition

Setting: Multimodal dialogue where an adversary provides an image v and query x to induce a harmful response y.

Inputs: Image v, Text query x

Outputs: Safe response y (either the initial response or a regenerated one)

Pipeline Flow

Initial Generation: MLLM generates response y to (v, x)
Harm Detection: MLLM self-evaluates safety of y
Conditional Branch: If safe -> Return y; If unsafe -> Proceed to transform
I2T Transformation: MLLM generates query-aware caption c for image v given query x
Safe Generation: MLLM generates final response using (c, x) without image v

System Modules

Initial Generator (Generation)

Generate a preliminary response to the user query and image.

Model or implementation: Target MLLM (e.g., LLaVA-1.5-7B)

Safety Discriminator

Assess whether the initial response contains harmful content.

Model or implementation: Target MLLM (Self-evaluation)

I2T Transformer

Convert the image into a text caption relevant to the specific query, preserving necessary context.

Model or implementation: Target MLLM

Safe Generator (Generation)

Generate the final response using the text caption instead of the image, activating LLM safety.

Model or implementation: Target MLLM (operating as text-only LLM)

Novel Architectural Elements

Query-aware I2T feedback loop: Dynamically converting images to text conditioned on the specific user query only when a safety violation is detected.
Eyes-closed regeneration: Explicitly removing the visual modality during the second pass to force the model to rely on its text-based safety alignment.

Modeling

Base Model: Evaluated on LLaVA-1.5-7B, ShareGPT4V-7B, mPLUG-OWL2-7B, Qwen-VL-Chat, InternLM-XComposer

Training Method: Training-free inference-time intervention (except for the Data Engine experiment which uses SFT)

Compute: Inference only. Requires one extra generation pass for safety check, and two extra passes (captioning + regeneration) if unsafe content is detected.

Comparison to Prior Work

vs. Red-teaming: ECSO is training-free and doesn't require curating attack datasets.
vs. System Prompts: ECSO actively modifies the input modality (removing image) rather than just instructing the model.
vs. Auto-Moderation [Chen et al.]: ECSO removes the image during correction, whereas prior moderation methods keep the image, which often leads to persistent failure or refusal to answer.
+ 1 more
vs. Pi et al. [48]: ECSO does not require external detectors or detoxifiers; it is self-contained.

Limitations

Relies on the underlying LLM's text-only safety capabilities; if the LLM is unsafe, ECSO fails.
Incurs increased inference latency due to multiple generation steps (detection, captioning, regeneration) for unsafe queries.
Image-to-text transformation may lose fine-grained visual details necessary for some benign tasks falsely flagged as unsafe.

Reproducibility

Code: https://github.com/gyhdog99/ECSO

Code is publicly available at https://github.com/gyhdog99/ECSO. The method is training-free and relies on prompting, making it highly reproducible given the prompts in the paper.

📊 Experiments & Results

Evaluation Setup

Safety evaluation against visual jailbreaks and utility evaluation on standard MLLM benchmarks.

Benchmarks:

MM-SafetyBench (Visual Jailbreak (SD, OCR, SD+OCR))
VLSafe (Visual Jailbreak (Text-based attacks with auxiliary images))
MME (General MLLM Utility (Perception & Cognition))
MM-Vet (General MLLM Utility)

Metrics:

Harmless Rate (HR)
Accuracy / Score (for utility)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ECSO significantly improves safety (Harmless Rate) across multiple attack types in MM-SafetyBench using LLaVA-1.5-7B.
MM-SafetyBench (OCR)	Harmless Rate	31.7	90.3	+58.6
MM-SafetyBench (SD+OCR)	Harmless Rate	32.1	86.4	+54.3
VLSafe	Harmless Rate	19.3	90.6	+71.3
Utility benchmarks show that ECSO maintains or even slightly improves performance on benign tasks compared to direct prompting.
MME-P (Perception)	Score	1521.8	1507.0	-14.8
MME-C (Cognition)	Score	312.1	342.5	+30.4
MM-Vet	GPT Score	31.2	32.3	+1.1

Experiment Figures

Comparison of harmless rates with and without images on VLSafe dataset.

Accuracy of MLLMs in detecting unsafe content in their own responses.

Main Takeaways

MLLMs are vulnerable to visual jailbreaks but retain the ability to self-detect unsafe content with high accuracy (over 95%).
Removing the image and relying on text captions (Eyes Closed) effectively reactivates the safety mechanisms of the pre-aligned LLM.
Query-aware captioning is critical; generic captioning leads to significant performance drops on utility tasks.
ECSO can serve as a data engine to generate high-quality SFT data for safety alignment without human intervention, outperforming models trained on human-verified data.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Jailbreak attacks (visual/textual)
Safety alignment (RLHF, SFT)

Key Terms

MLLM: Multimodal Large Language Model—an AI system capable of processing both text and images (e.g., LLaVA, GPT-4V).

Jailbreak: An attack that tricks a model into bypassing its safety filters to generate harmful content.

OCR attack: A jailbreak method where malicious instructions are embedded as text within an image to evade text-based safety filters.

SFT: Supervised Fine-Tuning—training a model on labeled examples to teach it specific behaviors.

I2T: Image-to-Text transformation—converting visual information into a textual description (caption).

SD: Stable Diffusion—a generative model used here to create malicious images for testing.

Harmless Rate: The percentage of model responses that are considered safe/benign.

LLaVA: Large Language and Vision Assistant—an open-source MLLM architecture.