The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

📝 Paper Summary

World Models Anomaly Detection Safety and Robustness in RL

World model-based agents exhibit a sharp, universal detection threshold for gradual drift but are completely blind to sinusoidal perturbations and may suffer catastrophic collapse before detection occurs.

Core Problem

Current RL agents using world models for self-monitoring can detect abrupt changes, but their ability to notice gradual, imperceptible sensor corruption is unknown.

Why it matters:

Real-world sensor degradation (fogging, calibration drift) is rarely abrupt, making 'boiled frog' failures a critical safety risk
Existing anomaly detection methods focus on abrupt changes or static datasets, missing the dynamics of gradual drift in active agents
If agents cannot detect drift before policy collapse, they are fundamentally unsafe for long-term deployment

Concrete Example: In the Hopper environment, a robot's velocity sensors drift gradually. At intermediate drift intensities, the robot falls over (policy collapse) within ~25 steps, but the detector needs ~50 steps of data to fire, meaning the agent dies before it realizes anything is wrong.

Key Novelty

The Boiling Frog Threshold (ε*)

Identifies a sharp sigmoid boundary separating invisible drift from rapid detection, invariant across detector types and model capacities
Discovers 'Sinusoidal Blindness': world models absorb periodic drift as normal variation (optimizing model evidence), rendering it invisible to all prediction-error-based detectors
Reframes detection thresholds as a three-way interaction between noise floor structure, detector sensitivity, and environment dynamics, rather than a simple signal-to-noise ratio

Evaluation Highlights

Sinusoidal drift is completely undetectable (0% detection rate) across all environments and detector families, even at high intensities (ε=0.5)
Threshold existence is universal: all 576 linear drift conditions show a sharp sigmoid transition from ~0% to ~100% detection
In Hopper, 'Collapse Before Awareness' occurs at ε=0.05, where the agent fails in ~25 steps while detectors require significantly longer to trigger

Breakthrough Assessment

9/10

Reveals fundamental, qualitative limitations (sinusoidal blindness, collapse before awareness) in a widely used paradigm (world model monitoring). The findings challenge the assumption that better models yield better monitoring.

⚙️ Technical Details

Problem Definition

Setting: Self-monitoring in RL agents under continuous observation drift

Inputs: Stream of observations s_t and actions a_t from an environment undergoing gradual drift g(t)

Outputs: Binary anomaly flag (drift detected vs. normal)

Pipeline Flow

Agent-Environment Loop: Policy interacts with environment → generates transitions
World Model: Predicts next state s'_{t+1} from (s_t, a_t)
Monitor: Calculates Prediction Error (PE) = ||s'_{t+1} - s_{t+1}||^2
Detector: Analyzes PE stream (using DI, Variance, or Percentile logic) → Outputs Anomaly Flag

System Modules

World Model

Predict next state to establish a baseline of 'normality'

Model or implementation: 3-layer MLP (Hidden sizes: 128, 512, or 1024)

Doubt Index Detector (Detection)

Detect anomalies based on z-score of smoothed prediction error

Model or implementation: Statistical Rule

Variance Detector (Detection)

Detect anomalies based on changes in prediction error variance

Model or implementation: Statistical Rule

Percentile Detector (Detection)

Detect anomalies based on raw prediction error quantiles (no smoothing)

Model or implementation: Statistical Rule

Novel Architectural Elements

Integration of three distinct detector families (Z-score, Variance, Percentile) operating on the same World Model PE stream to isolate detector-agnostic phenomena

Modeling

Base Model: 3-layer MLP (ReLU activations)

Training Method: Supervised Learning (MSE minimization)

Objective Functions:

Purpose: Minimize prediction error on collected transitions.

Formally: MSE = ||f_theta(s_t, a_t) - s_{t+1}||^2

Training Data:

Transitions collected by a pre-trained PPO policy
10^6 steps of interaction per environment

Key Hyperparameters:

hidden_sizes: Small (128), Medium (512), Large (1024)
drift_intensities: 16 values from 10^-4 to 0.5

Compute: Not reported in the paper

Comparison to Prior Work

vs. Domberg (2025): Focuses on GRADUAL drift rather than abrupt changes; finds thresholds invisible to abrupt detectors
vs. CUSUM/Page-Hinkley: Evaluates within an active RL loop where policy collapse interacts with detection delay
vs. Standard OOD [not cited in paper]: Investigates temporal dynamics of drift (linear vs. sinusoidal) rather than static distributional shift

Limitations

Studied only in MuJoCo locomotion environments; generalization to visual/complex tasks unknown
Used simple MLP world models; strictly latent dynamics models (like Dreamer) might behave differently
Drift applied only to velocity dimensions; other corruption types not tested
CBA phenomenon is highly environment-dependent (strong in Hopper, absent in HalfCheetah)

Reproducibility

No code URL provided. Methodological details (hyperparameters, drift formulas, detector definitions) are described in text. 10 seeds used per condition with Wilson score confidence intervals.

📊 Experiments & Results

Evaluation Setup

Four MuJoCo-v5 environments (HalfCheetah, Hopper, Walker2d, Ant) under induced sensor drift

Benchmarks:

HalfCheetah-v5 (Locomotion)
Hopper-v5 (Locomotion (fragile))
Walker2d-v5 (Locomotion)
Ant-v5 (Locomotion)

Metrics:

Detection Rate (at various drift intensities)
Survival Gap (Time to collapse - Time to detection)
Statistical methodology: Wilson score 95% confidence intervals; R^2 for power law fits

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Sinusoidal blindness results demonstrating that periodic drift is invisible to all detectors.
All environments	Detection Rate (Sinusoidal Drift)	~100	~0	-100
Power law fit results showing predictability of thresholds within environments but failure across environments.
Within-Environment Fit	R^2	0.0	0.97	+0.97
Cross-Environment Fit	R^2	0.89	0.45	-0.44
Collapse Before Awareness (CBA) analysis in Hopper.
Hopper	Survival Gap (steps)	0	-25	-25

Experiment Figures

Sigmoid detection curves for linear vs. sinusoidal drift across intensities

Signal Detection Theory (FPR vs Detection Rate) scatter plot for all detectors

Main Takeaways

A sharp 'Boiling Frog' detection threshold exists universally; below it, drift is absorbed as normal noise, above it, detection is rapid.
Sinusoidal drift is invisible because world models minimize prediction error by 'dreaming through' the oscillation, treating it as aleatoric variance.
The detection threshold ε* is not determined by model quality (MSE) but by an interaction of detector sensitivity and environment-specific noise structure (tail heaviness).
Increasing world model capacity (Small -> Large) reduces baseline MSE but does NOT change the detection threshold ε*, as detection is relative to the learned noise floor.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (PPO, policy, observations)
World Models (forward dynamics models, prediction error)
Signal Detection Theory (ROC curves, sensitivity vs. specificity)
Predictive Processing / Free Energy Principle (concept of precision weighting)

Key Terms

World Model: An internal neural network that predicts the next state of the environment given the current state and action

Prediction Error (PE): The difference between the world model's predicted next state and the actual observed next state; used here as the anomaly signal

Drift: A gradual corruption added to observations over time (e.g., slowly increasing velocity bias)

Sinusoidal Blindness: The phenomenon where world models learn to predict periodic noise as part of normal dynamics, making it invisible to anomaly detectors

Collapse Before Awareness (CBA): A failure mode where the agent's policy fails (e.g., robot falls) due to drift before the anomaly detector accumulates enough evidence to flag the issue

Doubt Index (DI): A detector family that tracks the z-score of prediction error using an exponential moving average

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used here to train the agent's policy

MuJoCo: A physics engine used for simulating robot control environments (HalfCheetah, Hopper, Walker2d, Ant)

MSE: Mean Squared Error—used to measure the magnitude of prediction error

ROC curve: Receiver Operating Characteristic curve—a plot illustrating the diagnostic ability of a binary classifier system as its discrimination threshold is varied

Wilson score interval: A confidence interval for a binomial proportion (e.g., detection rate), used for statistical reporting

EMA: Exponential Moving Average—a type of temporal smoothing used in the Doubt Index detector