Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

📝 Paper Summary

Safe Reinforcement Learning Offline Reinforcement Learning

FISOR enforces hard safety constraints in offline reinforcement learning by using reachability analysis to decouple reward maximization in safe regions from safety recovery in unsafe regions, training a diffusion policy via weighted regression.

Core Problem

Existing safe offline RL methods use soft constraints (limiting average cost), which allows for occasional catastrophic failures, and they struggle to balance the conflicting goals of reward maximization, safety, and behavior regularization.

Why it matters:

Soft constraints are unacceptable in safety-critical domains like industrial control and autonomous driving, where even a single violation can be disastrous
Jointly optimizing coupled objectives for safety and reward leads to unstable training and suboptimal policies in offline settings

Concrete Example: In an autonomous driving scenario, a 'soft constraint' method might allow the car to drive on the sidewalk 1% of the time to maintain high average speed. FISOR identifies the sidewalk as an 'infeasible region' and strictly prioritizes steering back to the road (minimizing safety risk) over speed, while optimizing speed only when safely on the road.

Key Novelty

FeasIbility-guided Safe Offline RL (FISOR)

Replaces soft constraints with Hamilton-Jacobi Reachability analysis to explicitly map out 'feasible regions' (states where safety is recoverable) using the offline dataset
Decouples the learning objective: maximizes rewards only within feasible regions, while minimizing safety violation risks in infeasible regions
Extracts the optimal policy using a diffusion model trained with a specific weighted regression loss, which is mathematically equivalent to energy-guided sampling but avoids training complex time-dependent classifiers

Architecture

Conceptual illustration of the Feasibility-Guided optimization strategy.

Evaluation Highlights

Guarantees safety satisfaction (zero constraint violations) in all evaluated tasks on the DSRL benchmark
Achieves top returns in most tasks compared to baselines like CPQ and RCRL
Demonstrates versatility by outperforming baselines in safe offline imitation learning contexts

Breakthrough Assessment

8/10

Addresses a critical flaw in safe RL (soft vs. hard constraints) with a theoretically grounded reachability approach. The decoupling of objectives and use of diffusion for policy extraction is a significant methodological advance.

⚙️ Technical Details

Problem Definition

Setting: Constrained Markov Decision Process (CMDP) in a fully offline setting

Inputs: Offline dataset D containing tuples (s, a, s', r, c) with mixed safe and unsafe trajectories

Outputs: A policy π that maximizes cumulative reward while strictly satisfying hard safety constraints (state-wise zero violation)

Pipeline Flow

Feasible Value Learning (Offline Training)
Reward Value Learning (Offline Training)
Policy Extraction (Offline Training)
Diffusion Sampling (Inference)

System Modules

Feasible Value Function (Vh)

Estimates the 'feasibility' of states (how close they are to being unsafe) using reversed expectile regression to approximate the largest feasible region

Model or implementation: Neural Network (Value Function)

Policy Network (Diffusion)

Generates actions by denoising random noise, guided by weights derived from feasibility and reward values

Model or implementation: Diffusion Model (e.g., MLP-based score network)

Novel Architectural Elements

Integration of HJ Reachability value functions directly into the weighting mechanism of a diffusion policy
Use of 'reversed expectile regression' to estimate minimum feasible values without behavioral modeling

Modeling

Base Model: Diffusion Model (Score-based)

Training Method: Offline Reinforcement Learning via Weighted Regression

Objective Functions:

Purpose: Learn the feasible region boundary (min feasible value).

Formally: Reversed expectile regression loss L_rev(u) = |τ - I(u>0)|u^2
Purpose: Train the policy to mimic optimal behavior based on feasibility.

Formally: Weighted regression loss min_θ E[w(s,a) * ||z - z_θ(a_t, s, t)||^2]
Purpose: Determine weights for policy training.

Formally: w(s,a) uses exp(A_r/α) in feasible regions and exp(-A_h/α) in infeasible regions

Key Hyperparameters:

expectile_tau: Values in (0.5, 1) for asymmetric loss
temperature_alpha: Regulates the strength of behavior regularization relative to reward/safety maximization

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. RCRL: FISOR decouples feasibility learning from policy learning to improve stability in offline settings
vs. CPQ/COptiDICE: FISOR enforces hard constraints (zero violation) via reachability analysis rather than soft constraints (expected cost)
vs. Diffuser [not cited in paper]: FISOR uses diffusion for single-step control with safety weights, whereas Diffuser plans entire trajectories

Limitations

Relies on the assumption that the offline dataset covers the state space sufficiently to estimate the feasible region boundary
Reachability analysis can be conservative if the dataset is sparse
Requires accurate estimation of value functions which can be challenging with limited data

Reproducibility

Code: https://zhengyinan-air.github.io/FISOR/

Code is publicly available at project website. The paper provides theoretical proofs in Appendix. Dataset used is DSRL benchmark.

📊 Experiments & Results

Evaluation Setup

Safe Offline RL tasks

Benchmarks:

DSRL Benchmark (Continuous control with safety constraints (e.g., SafetyGym tasks))

Metrics:

Normalized Reward
Constraint Violation Rate (Cost)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

FISOR is the only evaluated method that achieves zero constraint violations across all tasks, validating the effectiveness of the hard constraint formulation via reachability analysis.
Despite the strict safety enforcement, FISOR achieves the highest rewards in most tasks, suggesting that identifying the largest feasible region allows the policy to act aggressively where safe.
The decoupled training process stabilizes learning compared to coupled objectives like RCRL.
Qualitative results show agents in infeasible regions successfully navigating back to feasible regions before pursuing the goal.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Value Functions)
Constrained MDPs (CMDP)
Diffusion Probabilistic Models
Hamilton-Jacobi (HJ) Reachability Analysis

Key Terms

Soft constraint: A safety requirement that only limits the expected (average) cost below a threshold, allowing for occasional violations

Hard constraint: A strict safety requirement demanding zero constraint violations at every state and timestep

HJ Reachability: Hamilton-Jacobi Reachability—a method from control theory to determine the set of states (feasible region) from which a system can be controlled to stay safe indefinitely

Feasible Region: The set of states where there exists at least one policy that can satisfy the safety constraints

Diffusion Model: A generative model that learns to produce data (actions) by reversing a gradual noise-adding process

Expectile Regression: A statistical technique used here to estimate the extreme (minimum/maximum) values of a distribution without explicit policy sampling, used to find the boundary of feasible regions

Weighted Behavior Cloning: A supervised learning approach where the policy copies the dataset's actions but with weights assigned to prioritize better/safer actions