← Back to Paper List

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma
arXiv (2026)
MM Benchmark RL Reasoning

📝 Paper Summary

Multimodal Safety Alignment Benchmark Construction
OOD-MMSafe reveals that frontier MLLMs suffer from causal blindness regarding latent environmental hazards, and proposes CASPO to fix this by using the model's intrinsic reasoning as a dynamic safety reference.
Core Problem
Current MLLM safety alignment focuses on detecting malicious intent in text, failing to foresee hazardous physical consequences (causal blindness) when innocent-looking queries are paired with dangerous visual contexts.
Why it matters:
  • Autonomous agents deployed in the real world must anticipate the cascading physical outcomes of their actions to prevent irreversible harm (e.g., fires, explosions)
  • Existing benchmarks rely on 'intent-driven' safety (e.g., bomb-making instructions), missing 'consequence-driven' risks where danger emerges only from the specific environment state
  • Standard alignment (RLHF/DPO) hits a 'preference ceiling' on high-capacity models, where static data fails to improve—and sometimes degrades—complex safety reasoning
Concrete Example: If a user asks 'How do I turn this on?' while showing a gas stove with a leak, a standard model—seeing no malicious intent in the text—provides instructions, causing an explosion. A consequence-aware model would identify the leak and refuse.
Key Novelty
Consequence-Aware Safety Policy Optimization (CASPO)
  • Shifts alignment focus from 'intent detection' to 'causal projection' by extending the MDP state space to include terminal environmental consequences
  • Uses the model's own reasoning (guided by a safety constitution) as a dynamic 'moving target' for supervision, rather than relying on static human preference labels that lag behind model capability
  • Introduces token-level self-distillation rewards that encourage the model to match the probability distribution of its 'safer self' rather than a fixed dataset
Architecture
Architecture Figure Figure 1
Conceptual illustration of the Consequence-Driven Safety Paradigm vs. Intent/Situation-Driven Paradigms
Evaluation Highlights
  • Reduces risk identification failure ratio to 5.7% for Qwen3-VL-4B using CASPO, compared to ~51% failure in the base model
  • Identifies a 'preference ceiling' where standard DPO alignment yields a negative gain (-1.5%) on Qwen3-VL-4B, proving static alignment can be counter-productive
  • Reveals pervasive causal blindness in frontier models: Gemini-3-Pro fails to identify latent hazards in 29.7% of cases, and LLaVA-1.5-7B fails in 92.3%
Breakthrough Assessment
9/10
Identifies a fundamental flaw in current safety paradigms (causal blindness) and demonstrates that standard RLHF fails to solve it. Proposes a novel self-distillation solution (CASPO) that drastically reduces failure rates.
×