Evaluation Setup
Safety evaluation against visual jailbreaks and utility evaluation on standard MLLM benchmarks.
Benchmarks:
- MM-SafetyBench (Visual Jailbreak (SD, OCR, SD+OCR))
- VLSafe (Visual Jailbreak (Text-based attacks with auxiliary images))
- MME (General MLLM Utility (Perception & Cognition))
- MM-Vet (General MLLM Utility)
Metrics:
- Harmless Rate (HR)
- Accuracy / Score (for utility)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ECSO significantly improves safety (Harmless Rate) across multiple attack types in MM-SafetyBench using LLaVA-1.5-7B. |
| MM-SafetyBench (OCR) |
Harmless Rate |
31.7 |
90.3 |
+58.6
|
| MM-SafetyBench (SD+OCR) |
Harmless Rate |
32.1 |
86.4 |
+54.3
|
| VLSafe |
Harmless Rate |
19.3 |
90.6 |
+71.3
|
| Utility benchmarks show that ECSO maintains or even slightly improves performance on benign tasks compared to direct prompting. |
| MME-P (Perception) |
Score |
1521.8 |
1507.0 |
-14.8
|
| MME-C (Cognition) |
Score |
312.1 |
342.5 |
+30.4
|
| MM-Vet |
GPT Score |
31.2 |
32.3 |
+1.1
|
Main Takeaways
- MLLMs are vulnerable to visual jailbreaks but retain the ability to self-detect unsafe content with high accuracy (over 95%).
- Removing the image and relying on text captions (Eyes Closed) effectively reactivates the safety mechanisms of the pre-aligned LLM.
- Query-aware captioning is critical; generic captioning leads to significant performance drops on utility tasks.
- ECSO can serve as a data engine to generate high-quality SFT data for safety alignment without human intervention, outperforming models trained on human-verified data.