← Back to Paper List

Foundational World Models Accurately Detect Bimanual Manipulator Failures

Isaac R. Ward, Michelle Ho, Houjun Liu, Aaron Feldman, Joseph Vincent, Liam Kruse, Sean Cheong, Duncan Eddy, Mykel J. Kochenderfer, Mac Schwager
arXiv (2026)
MM Benchmark

📝 Paper Summary

Robotic Failure Detection World Models Anomaly Detection
This paper presents a probabilistic world model trained in the latent space of a vision foundation model to detect bimanual robot failures by identifying high-uncertainty predictions.
Core Problem
Detecting failures in bimanual manipulators is difficult because defining explicit failure modes in high-dimensional state spaces (visual + proprioceptive) is infeasible, and classical methods struggle with temporal correlations.
Why it matters:
  • Failures in high-stakes environments (e.g., data centers) can cause property damage, delays, and safety risks
  • Bimanual tasks require tight coordination, making them vulnerable to small errors that cascade into complex failure modes hard to capture with simple thresholds
  • Existing distribution modeling approaches (like normalizing flows) often assign high likelihoods to anomalous data or focus on low-level pixel correlations rather than semantic behavior
Concrete Example: In a data center cable maintenance task, a robot might drop a cable while attempting to plug it in. A standard monitor might miss this if the kinematics look normal, but the proposed model detects the mismatch between the expected future visual state (plugged in) and the actual observation (dropped cable).
Key Novelty
Latent Space World Model for Anomaly Detection
  • Leverages a pre-trained vision foundation model (NVIDIA Cosmos Tokenizer) to compress high-dimensional robot observations into a compact latent space
  • Trains a probabilistic transformer to forecast future latent states based only on nominal data, treating high predictive uncertainty or error during deployment as a signal of failure
  • Calibrates failure thresholds using conformal prediction to provide statistical guarantees on false alarm rates
Architecture
Architecture Figure Figure 1
The system architecture for the World Model-based failure detector.
Evaluation Highlights
  • Outperforms the next-best learning-based approach by 3.8% in failure detection rate on the Bimanual Cable Manipulation dataset
  • Achieves this performance using less than 600k trainable parameters, approximately 1/20th of the parameters required by the next-best approach
  • Successfully detects diverse failure modes (e.g., color changes, friction changes) in the simulated Push-T environment where baselines struggle
Breakthrough Assessment
7/10
Demonstrates a highly efficient application of foundation models to robotic safety, significantly reducing parameter count while improving detection rates. The introduction of a real-world bimanual dataset is a valuable contribution.
×