← Back to Paper List

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
arXiv.org (2025)
MM Agent Benchmark

📝 Paper Summary

Embodied AI Safety Vision-Language Model (VLM) Planning Benchmarks and Evaluation
IS-Bench assesses embodied agents by verifying if they mitigate dynamic safety risks in the correct procedural order during execution, rather than solely checking final states.
Core Problem
Existing embodied safety benchmarks are static or termination-oriented, failing to detect intermediate unsafe actions or dynamic risks that emerge only during interaction.
Why it matters:
  • Flawed VLM planning in household robots creates physical hazards (e.g., fires, contamination) that prevent real-world deployment
  • Checking only the final state misses temporary unsafe states (e.g., using a dirty plate and cleaning it later) that are still dangerous
  • Text-only or single-image benchmarks cannot evaluate an agent's ability to perceive risks that only become visible after an action (e.g., opening a cabinet)
Concrete Example: In a food preparation task, an agent might place an apple on a plate covered in stains (unsafe) and then wash the plate later. A termination-oriented evaluation would count this as safe because the plate is clean at the end, but IS-Bench detects the intermediate contamination risk.
Key Novelty
Process-Oriented Interactive Safety Evaluation
  • Defines 'Interactive Safety' as the ability to perceive emergent risks and execute mitigation steps in the correct order (Pre-caution vs. Post-caution)
  • Implements a process-oriented evaluation that triggers safety checks immediately before or after specific risk-prone actions, rather than just at the end of the task
  • Instantiates dynamic risks (e.g., hidden stains, precarious objects) in a high-fidelity physics simulator (OmniGibson) to test real-time perception
Architecture
Architecture Figure Figure 3
The IS-Bench Evaluation Framework, detailing the loop between the Agent and the OmniGibson Simulation.
Evaluation Highlights
  • Current state-of-the-art VLM agents (including GPT-4o and Gemini-2.5) achieve a Safe Success Rate of less than 40% on the benchmark
  • Safety-aware Chain-of-Thought (CoT) prompting improves interactive safety by an average of 9.3% across tested models
  • However, Safety-aware CoT creates a trade-off, decreasing overall task completion rates by an average of 9.4%
Breakthrough Assessment
9/10
Addresses a critical blind spot in embodied AI safety (process vs. outcome). The shift from static/termination checks to dynamic/procedural verification is a necessary step for deploying real-world agents.
×