Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, S. Li, Xianyuan Zhan, Jingjing Liu
Institute for AI Industry Research (AIR), Tsinghua University,
Department of Computer Science, The University of Hong Kong,
Shanghai Artificial Intelligence Laboratory
International Conference on Learning Representations
(2024)
FISOR enforces hard safety constraints in offline reinforcement learning by using reachability analysis to decouple reward maximization in safe regions from safety recovery in unsafe regions, training a diffusion policy via weighted regression.
Core Problem
Existing safe offline RL methods use soft constraints (limiting average cost), which allows for occasional catastrophic failures, and they struggle to balance the conflicting goals of reward maximization, safety, and behavior regularization.
Why it matters:
Soft constraints are unacceptable in safety-critical domains like industrial control and autonomous driving, where even a single violation can be disastrous
Jointly optimizing coupled objectives for safety and reward leads to unstable training and suboptimal policies in offline settings
Concrete Example:In an autonomous driving scenario, a 'soft constraint' method might allow the car to drive on the sidewalk 1% of the time to maintain high average speed. FISOR identifies the sidewalk as an 'infeasible region' and strictly prioritizes steering back to the road (minimizing safety risk) over speed, while optimizing speed only when safely on the road.
Key Novelty
FeasIbility-guided Safe Offline RL (FISOR)
Replaces soft constraints with Hamilton-Jacobi Reachability analysis to explicitly map out 'feasible regions' (states where safety is recoverable) using the offline dataset
Decouples the learning objective: maximizes rewards only within feasible regions, while minimizing safety violation risks in infeasible regions
Extracts the optimal policy using a diffusion model trained with a specific weighted regression loss, which is mathematically equivalent to energy-guided sampling but avoids training complex time-dependent classifiers
Architecture
Conceptual illustration of the Feasibility-Guided optimization strategy.
Evaluation Highlights
Guarantees safety satisfaction (zero constraint violations) in all evaluated tasks on the DSRL benchmark
Achieves top returns in most tasks compared to baselines like CPQ and RCRL
Demonstrates versatility by outperforming baselines in safe offline imitation learning contexts
Breakthrough Assessment
8/10
Addresses a critical flaw in safe RL (soft vs. hard constraints) with a theoretically grounded reachability approach. The decoupling of objectives and use of diffusion for policy extraction is a significant methodological advance.
⚙️ Technical Details
Problem Definition
Setting: Constrained Markov Decision Process (CMDP) in a fully offline setting
Inputs: Offline dataset D containing tuples (s, a, s', r, c) with mixed safe and unsafe trajectories
Outputs: A policy π that maximizes cumulative reward while strictly satisfying hard safety constraints (state-wise zero violation)
Pipeline Flow
Feasible Value Learning (Offline Training)
Reward Value Learning (Offline Training)
Policy Extraction (Offline Training)
Diffusion Sampling (Inference)
System Modules
Feasible Value Function (Vh)
Estimates the 'feasibility' of states (how close they are to being unsafe) using reversed expectile regression to approximate the largest feasible region
Model or implementation: Neural Network (Value Function)
Policy Network (Diffusion)
Generates actions by denoising random noise, guided by weights derived from feasibility and reward values
Model or implementation: Diffusion Model (e.g., MLP-based score network)
Novel Architectural Elements
Integration of HJ Reachability value functions directly into the weighting mechanism of a diffusion policy
Use of 'reversed expectile regression' to estimate minimum feasible values without behavioral modeling
Modeling
Base Model: Diffusion Model (Score-based)
Training Method: Offline Reinforcement Learning via Weighted Regression
Objective Functions:
Purpose: Learn the feasible region boundary (min feasible value).
Formally: Reversed expectile regression loss L_rev(u) = |τ - I(u>0)|u^2
Purpose: Train the policy to mimic optimal behavior based on feasibility.
Code is publicly available at project website. The paper provides theoretical proofs in Appendix. Dataset used is DSRL benchmark.
📊 Experiments & Results
Evaluation Setup
Safe Offline RL tasks
Benchmarks:
DSRL Benchmark (Continuous control with safety constraints (e.g., SafetyGym tasks))
Metrics:
Normalized Reward
Constraint Violation Rate (Cost)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
FISOR is the only evaluated method that achieves zero constraint violations across all tasks, validating the effectiveness of the hard constraint formulation via reachability analysis.
Despite the strict safety enforcement, FISOR achieves the highest rewards in most tasks, suggesting that identifying the largest feasible region allows the policy to act aggressively where safe.
The decoupled training process stabilizes learning compared to coupled objectives like RCRL.
Qualitative results show agents in infeasible regions successfully navigating back to feasible regions before pursuing the goal.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning (MDPs, Value Functions)
Constrained MDPs (CMDP)
Diffusion Probabilistic Models
Hamilton-Jacobi (HJ) Reachability Analysis
Key Terms
Soft constraint: A safety requirement that only limits the expected (average) cost below a threshold, allowing for occasional violations
Hard constraint: A strict safety requirement demanding zero constraint violations at every state and timestep
HJ Reachability: Hamilton-Jacobi Reachability—a method from control theory to determine the set of states (feasible region) from which a system can be controlled to stay safe indefinitely
Feasible Region: The set of states where there exists at least one policy that can satisfy the safety constraints
Diffusion Model: A generative model that learns to produce data (actions) by reversing a gradual noise-adding process
Expectile Regression: A statistical technique used here to estimate the extreme (minimum/maximum) values of a distribution without explicit policy sampling, used to find the boundary of feasible regions
Weighted Behavior Cloning: A supervised learning approach where the policy copies the dataset's actions but with weights assigned to prioritize better/safer actions