IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

📝 Paper Summary

Embodied AI Safety Vision-Language Model (VLM) Planning Benchmarks and Evaluation

IS-Bench assesses embodied agents by verifying if they mitigate dynamic safety risks in the correct procedural order during execution, rather than solely checking final states.

Core Problem

Existing embodied safety benchmarks are static or termination-oriented, failing to detect intermediate unsafe actions or dynamic risks that emerge only during interaction.

Why it matters:

Flawed VLM planning in household robots creates physical hazards (e.g., fires, contamination) that prevent real-world deployment
Checking only the final state misses temporary unsafe states (e.g., using a dirty plate and cleaning it later) that are still dangerous
Text-only or single-image benchmarks cannot evaluate an agent's ability to perceive risks that only become visible after an action (e.g., opening a cabinet)

Concrete Example: In a food preparation task, an agent might place an apple on a plate covered in stains (unsafe) and then wash the plate later. A termination-oriented evaluation would count this as safe because the plate is clean at the end, but IS-Bench detects the intermediate contamination risk.

Key Novelty

Process-Oriented Interactive Safety Evaluation

Defines 'Interactive Safety' as the ability to perceive emergent risks and execute mitigation steps in the correct order (Pre-caution vs. Post-caution)
Implements a process-oriented evaluation that triggers safety checks immediately before or after specific risk-prone actions, rather than just at the end of the task
Instantiates dynamic risks (e.g., hidden stains, precarious objects) in a high-fidelity physics simulator (OmniGibson) to test real-time perception

Architecture

The IS-Bench Evaluation Framework, detailing the loop between the Agent and the OmniGibson Simulation.

Evaluation Highlights

Current state-of-the-art VLM agents (including GPT-4o and Gemini-2.5) achieve a Safe Success Rate of less than 40% on the benchmark
Safety-aware Chain-of-Thought (CoT) prompting improves interactive safety by an average of 9.3% across tested models
However, Safety-aware CoT creates a trade-off, decreasing overall task completion rates by an average of 9.4%

Breakthrough Assessment

9/10

Addresses a critical blind spot in embodied AI safety (process vs. outcome). The shift from static/termination checks to dynamic/procedural verification is a necessary step for deploying real-world agents.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) for embodied task planning with safety constraints

Inputs: High-level language instruction L and visual observation sequence I_t

Outputs: Action sequence π = (a_0, ..., a_n) that satisfies task goals G_task and safety goals G_safe

Pipeline Flow

Scenario Instantiation (OmniGibson) -> Agent Perception -> VLM Planning -> Action Execution -> Safety Verification

System Modules

Scenario Generator

Instantiates household tasks with 388 unique safety risks (dynamic and static)

Model or implementation: OmniGibson Simulator

VLM Planner

Generates executable actions based on observations and instructions

Model or implementation: Various VLMs (GPT-4o, Gemini-2.5, Claude-3.7, etc.)

Safety Evaluator

Verifies safety constraints using triggers (Pre/Post-caution) during execution

Model or implementation: Rule-based Checker (PDDL)

Novel Architectural Elements

Trigger-based verification system that binds safety goals (G_safe) to specific actions (a_risk) via a trigger relation (R)
Dual-format safety goals: Natural language for the agent and PDDL predicates for the automated evaluator

Modeling

Base Model: Evaluated multiple models: GPT-4o, Gemini-2.5, Claude-3.7-Sonnet, Qwen2.5-VL, InternVL3, Llama-3.2

Training Method: The paper presents a benchmark and evaluates pre-trained models via prompting (Zero-shot, CoT)

Training Data:

161 interactive scenarios derived from Behavior-1K
388 safety risks based on 30 safety principles (OSHA/HSE standards)

Comparison to Prior Work

vs. SafePlanBench: IS-Bench evaluates multi-modal perception of risks rather than just text reasoning
vs. MSSBench: IS-Bench is interactive, allowing agents to uncover risks hidden in the initial view (dynamic risks) vs. static image analysis
vs. All Prior Work: IS-Bench uses process-oriented evaluation (checking intermediate steps) rather than termination-oriented (final state only)

Limitations

Requires high-fidelity simulation (OmniGibson), which is computationally heavier than text or static image benchmarks
Currently limited to household tasks; does not cover industrial or outdoor safety scenarios
Success relies heavily on the underlying VLM's ability to map visual observations to PDDL-style states
CoT prompting improves safety but significantly degrades task efficiency/success

Reproducibility

Code: https://github.com/AI45Lab/IS-Bench

Benchmark code and scenarios are publicly available at https://github.com/AI45Lab/IS-Bench. The dataset includes fine-grained annotations for risk-prone steps and safety goals. The simulator used is OmniGibson.

📊 Experiments & Results

Evaluation Setup

Interactive embodied agents performing household tasks in OmniGibson

Benchmarks:

IS-Bench (Embodied Task Planning with Safety Constraints) [New]

Metrics:

Safe Success Rate (SSR)
Success Rate (SR)
Safety Recall (SRec)
Safety Awareness (SA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of leading VLMs reveals a significant gap in interactive safety capabilities and a trade-off when using safety-focused prompting.
IS-Bench	Safe Success Rate (SSR)	100.0	40.0	-60.0
IS-Bench	Interactive Safety Improvement	0.0	9.3	+9.3
IS-Bench	Task Success Rate Decrease	0.0	-9.4	-9.4

Experiment Figures

Comparison of different safety evaluation paradigms (Instruction Safety vs. Interactive Safety) and why static/termination checks fail.

Main Takeaways

Current VLM agents struggle significantly with interactive safety, often failing to mitigate dynamic risks (<40% safe success rate).
Process-oriented evaluation reveals unsafe behaviors (intermediate violations) that termination-oriented benchmarks miss.
Safety-aware Chain-of-Thought prompting is effective for safety (+9.3%) but detrimental to task completion (-9.4%), suggesting current models struggle to balance constraints.
The primary bottleneck is identified as perception and awareness—agents fail to 'see' the risk before acting.

📚 Prerequisite Knowledge

Prerequisites

Basics of Embodied AI and VLM agents
Markov Decision Processes (MDP/POMDP)
Chain-of-Thought (CoT) prompting
PDDL (Planning Domain Definition Language) predicates

Key Terms

VLM: Vision-Language Model—AI models that can process both images and text to generate text or actions

PDDL: Planning Domain Definition Language—a standardized language used to define states, actions, and goals in planning problems using logical predicates

POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly observe the full state of the world

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

OmniGibson: A high-fidelity physics simulator used for training and evaluating embodied AI agents in realistic household environments

Pre-caution: A safety condition that must be satisfied BEFORE a specific risk-prone action is taken (e.g., 'ensure stove is clear before turning on')

Post-caution: A safety condition that must be satisfied AFTER a specific action is taken (e.g., 'turn off stove after cooking')

Process-oriented evaluation: Evaluating an agent's performance by checking constraints at specific steps during execution, rather than only checking the final result