Uncertainty Estimation for Safety-critical Scene Segmentation via Fine-grained Reward Maximization

📝 Paper Summary

Uncertainty Estimation Medical Image Segmentation Reinforcement Learning for Vision

FGRM fine-tunes segmentation models using reinforcement learning with a calibration-based reward and a Fisher Information-weighted parameter update scheme to explicitly align model confidence with prediction risk.

Core Problem

Existing uncertainty estimation methods rely on indirect task losses (like Cross-Entropy) rather than explicit uncertainty metrics during training, leading to miscalibrated confidence and unclear priors in safety-critical scenarios.

Why it matters:

In safety-critical domains like robotic surgery, tolerance for prediction risk is extremely low, making reliable uncertainty estimation as crucial as accuracy
Current approaches often lack explicit guidance for calibrating prediction risk, resulting in over-confidence on out-of-distribution data or ambiguous tissue boundaries
Standard reinforcement learning updates are uniform across parameters, which is suboptimal for dense segmentation tasks where parameter importance varies significantly

Concrete Example: In a laparoscopic surgery scene, a standard segmentation model might classify an ambiguous boundary between the liver and fat with high confidence (low uncertainty) because it was trained only on segmentation accuracy (Dice), potentially leading a surgical robot to make a dangerous cut.

Key Novelty

Fine-Grained Reward Maximization (FGRM)

Treats uncertainty estimation as a reward maximization problem where the reward is directly the calibration metric (e.g., negative Expected Calibration Error), rather than a proxy loss
Uses the diagonal of the Fisher Information Matrix to weigh parameter updates, assigning larger updates to parameters that are more important for the model's output distribution

Architecture

Overview of the FGRM framework, illustrating the pre-training and RL fine-tuning phases

Evaluation Highlights

Reduces In-Distribution Expected Calibration Error (ECE) by ~2.1 points (11.74 -> 9.63) compared to state-of-the-art NatPN on the Laparoscopic Cholecystectomy dataset
Improves Out-of-Distribution detection (Pixel Ratio) by +0.29 over NatPN and +0.93 over Deep Ensemble on the LC dataset
Maintains real-time inference speed (0.052ms per image) while significantly outperforming ensemble methods (0.201ms) on calibration metrics

Breakthrough Assessment

8/10

Novel application of RL to fine-tune uncertainty calibration directly. The Fisher Information-based update scheme addresses the difficulty of applying RL to dense prediction tasks. Strong empirical results on medical datasets.

⚙️ Technical Details

Problem Definition

Setting: Safety-critical scene segmentation with quantification of aleatoric (data) and epistemic (model) uncertainty

Inputs: Input image x

Outputs: Segmentation prediction ŷ and uncertainty map µ

Pipeline Flow

Pre-trained Segmentation Backbone (TransUNet)
Evidence Layer (Softplus)
Uncertainty Quantification (Aleatoric & Epistemic)
Policy Gradient Update (Training only)

System Modules

Segmentation Backbone

Extract features from input image and predict evidence

Model or implementation: Adapted TransUNet

Evidence Layer

Parameterize the Dirichlet distribution to separate uncertainties

Model or implementation: Softplus activation

Novel Architectural Elements

Integration of Fisher Information Matrix directly into the Policy Gradient update rule for fine-grained parameter weighting

Modeling

Base Model: TransUNet (adapted)

Training Method: Reinforcement Learning (Policy Gradient with Fisher Information weighting)

Objective Functions:

Purpose: Calibrate uncertainty prediction by maximizing a calibration metric.

Formally: J(ϕ) = E[R(µ, ŷ, y) - β * KL(π_ϕ || π_θ)]
Purpose: Pre-training objective to learn evidence.

Formally: Bayes maximum likelihood loss (Eq. 5) integrating out the categorical probabilities

Adaptation: Fine-tuning of the pre-trained backbone parameters

Trainable Parameters: All backbone parameters (fine-grained update)

Training Data:

CholecSeg8K: 80% train, 20% val, separate 20% held-out test split
ESD dataset: 80% train, 20% val, separate 20% held-out test split

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 4
pre_train_epochs: 10
+ 1 more
kl_penalty_beta: Controls strength of KL constraint (analyzed in ablation)

Compute: Inference time: 0.052ms per image (Real-time one forward pass)

Comparison to Prior Work

vs. Deep Ensemble: FGRM requires only one forward pass (faster) and optimizes calibration directly via RL
vs. NatPN: FGRM achieves better calibration (lower ECE) by explicitly maximizing the calibration reward rather than relying solely on the ELBO/evidence loss
vs. ConfidNet: FGRM does not require an auxiliary branch network
+ 1 more
vs. Standard RL for Vision [not cited in paper]: FGRM introduces Fisher-weighted updates instead of uniform updates used in typical RL-based vision tuning

Limitations

Different reward functions are currently designed for In-Distribution and Out-of-Distribution scenarios separately; a unified reward is future work
Requires calculation of the Fisher Information Matrix (first-order approximation), which adds computational overhead during training
Evaluation is limited to segmentation tasks in the medical domain

Reproducibility

Code: https://github.com/med-air/FGRM

Code is publicly available at https://github.com/med-air/FGRM. Datasets: CholecSeg8K is public; ESD dataset is collected by authors (availability unclear). Hyperparameters and algorithms are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Scene segmentation on surgical video frames

Benchmarks:

Laparoscopic Cholecystectomy (LC) (Multi-class soft tissue segmentation)
Endoscopic Submucosal Dissection (ESD) (Multi-class tissue segmentation) [New]

Metrics:

Expected Calibration Error (ECE)
Uncertainty-error Mutual Information (MI)
Dice Score
Pixel Ratio (PR) for OOD
Box Ratio (BR) for OOD
Statistical methodology: Results reported as mean ± std of three independent runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis on the LC dataset showing FGRM outperforms baselines in both calibration (ID) and OOD detection.
Laparoscopic Cholecystectomy	ECE (In-distribution)	11.74	9.63	-2.11
Laparoscopic Cholecystectomy	Dice (In-distribution)	74.16	74.88	+0.72
Laparoscopic Cholecystectomy	Pixel Ratio (OOD)	1.56	1.85	+0.29
Comparative analysis on the ESD dataset confirming consistent improvements across different surgical scenes.
Endoscopic Submucosal Dissection	ECE (In-distribution)	13.67	10.42	-3.25
Endoscopic Submucosal Dissection	Pixel Ratio (OOD)	1.45	1.78	+0.33

Experiment Figures

Ablation study of components and sensitivity to KL penalty strength

Scatter plot of uncertainty vs. correctness and training progression

Main Takeaways

FGRM consistently outperforms state-of-the-art methods (Deep Ensemble, NatPN) across all calibration and OOD metrics on both datasets
The fine-grained update scheme makes the method robust to different KL penalty strengths compared to uniform updates
Reward maximization explicitly correlates estimated uncertainty with prediction correctness (high uncertainty for incorrect predictions)
The method achieves these gains with a single forward pass, making it significantly faster than ensemble-based methods (0.052ms vs 0.201ms)

📚 Prerequisite Knowledge

Prerequisites

Evidential Deep Learning (EDL)
Reinforcement Learning (Policy Gradient)
Fisher Information Matrix
Bayesian Deep Learning concepts (Aleatoric vs Epistemic uncertainty)

Key Terms

ECE: Expected Calibration Error—a metric measuring the difference between the model's predicted confidence and its actual accuracy

Fisher Information Matrix: A matrix measuring the amount of information that an observable random variable carries about an unknown parameter; used here to quantify parameter importance

Aleatoric Uncertainty: Uncertainty arising from inherent noise or ambiguity in the data (e.g., blurred tissue boundaries)

Epistemic Uncertainty: Uncertainty arising from a lack of knowledge about the data (e.g., seeing a new organ type not in the training set)

Dirichlet Distribution: A probability distribution over probability distributions, used in Evidential Learning to model the conjugate prior of the categorical distribution

OOD: Out-of-Distribution—data samples that deviate significantly from the training data distribution

Policy Gradient: A reinforcement learning technique that optimizes the policy parameters by ascending the gradient of the expected reward