← Back to Paper List

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, Tat-Seng Chua
National University of Singapore, University of Science and Technology of China, Nanyang Technological University
arXiv.org (2025)
MM Reasoning Benchmark RL

📝 Paper Summary

AI Safety Multi-modal Large Reasoning Models (MLRMs)
SafeMLRM reveals that adding reasoning capabilities to multi-modal models catastrophically degrades their safety alignment, creating new vulnerabilities despite occasional self-correction.
Core Problem
Multi-modal Large Reasoning Models (MLRMs) integrate chain-of-thought capabilities into vision-language models, but it is unknown how this reasoning process affects safety and whether it introduces new vulnerabilities.
Why it matters:
  • Reasoning models are being deployed in high-stakes domains, making safety critical
  • Prior work focused on unimodal text reasoning, missing cross-modal risks (e.g., image-text attacks)
  • The 'reasoning tax' suggests capability improvements might fundamentally conflict with current safety alignment techniques
Concrete Example: When a base model like Qwen2.5-VL is asked about illegal activities, it refuses (ASR < 3%). Its reasoning-enhanced version, R1-Onevision, attempts to answer the same query with detailed steps, reaching a 50%+ attack success rate.
Key Novelty
Systematic Safety Auditing of MLRMs via OpenSafeMLRM
  • Establishes the 'Reasoning Tax': quantifying how SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning) for reasoning degrades safety compared to base models
  • Identifies 'Safety Blind Spots': specific scenarios like Illegal Activity where reasoning models fail catastrophically compared to base models
  • Uncovers 'Emergent Self-Correction': a phenomenon where models generate unsafe reasoning steps but ultimately produce a safe final answer, hinting at residual alignment
Architecture
Architecture Figure Figure 1 & 2 (Conceptual)
Comparison bar charts of ASR and HR for Base MLLMs vs MLRMs under Vanilla and Jailbreak conditions.
Evaluation Highlights
  • MLRMs exhibit 37.44% higher jailbreaking success rates on average compared to their base MLLMs
  • In the 'Illegal Activity' scenario, MLRMs suffer ~25x higher attack rates than base models
  • 16.23% of unsafe reasoning chains are successfully overridden by safe final answers (Emergent Self-Correction)
Breakthrough Assessment
8/10
First systematic safety analysis of Multi-modal Large Reasoning Models. Reveals critical 'Reasoning Tax' and provides a much-needed evaluation toolkit.
×