← Back to Paper List

The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

Zhe Hong
National University of Singapore
arXiv (2026)
RL Benchmark

📝 Paper Summary

World Models Anomaly Detection Safety and Robustness in RL
World model-based agents exhibit a sharp, universal detection threshold for gradual drift but are completely blind to sinusoidal perturbations and may suffer catastrophic collapse before detection occurs.
Core Problem
Current RL agents using world models for self-monitoring can detect abrupt changes, but their ability to notice gradual, imperceptible sensor corruption is unknown.
Why it matters:
  • Real-world sensor degradation (fogging, calibration drift) is rarely abrupt, making 'boiled frog' failures a critical safety risk
  • Existing anomaly detection methods focus on abrupt changes or static datasets, missing the dynamics of gradual drift in active agents
  • If agents cannot detect drift before policy collapse, they are fundamentally unsafe for long-term deployment
Concrete Example: In the Hopper environment, a robot's velocity sensors drift gradually. At intermediate drift intensities, the robot falls over (policy collapse) within ~25 steps, but the detector needs ~50 steps of data to fire, meaning the agent dies before it realizes anything is wrong.
Key Novelty
The Boiling Frog Threshold (ε*)
  • Identifies a sharp sigmoid boundary separating invisible drift from rapid detection, invariant across detector types and model capacities
  • Discovers 'Sinusoidal Blindness': world models absorb periodic drift as normal variation (optimizing model evidence), rendering it invisible to all prediction-error-based detectors
  • Reframes detection thresholds as a three-way interaction between noise floor structure, detector sensitivity, and environment dynamics, rather than a simple signal-to-noise ratio
Evaluation Highlights
  • Sinusoidal drift is completely undetectable (0% detection rate) across all environments and detector families, even at high intensities (ε=0.5)
  • Threshold existence is universal: all 576 linear drift conditions show a sharp sigmoid transition from ~0% to ~100% detection
  • In Hopper, 'Collapse Before Awareness' occurs at ε=0.05, where the agent fails in ~25 steps while detectors require significantly longer to trigger
Breakthrough Assessment
9/10
Reveals fundamental, qualitative limitations (sinusoidal blindness, collapse before awareness) in a widely used paradigm (world model monitoring). The findings challenge the assumption that better models yield better monitoring.
×