CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

📝 Paper Summary

Safe Reinforcement Learning Robotic Locomotion

CBF-RL trains policies to be inherently safe by applying active CBF safety filtering only during training and reinforcing safe behavior through barrier-based rewards, eliminating the need for runtime filters.

Core Problem

RL policies often prioritize performance over safety, leading to catastrophic failures, while traditional safety filters (CBFs) prune exploration too aggressively and require expensive optimization at runtime.

Why it matters:

Humanoid robots are expensive and operate in complex environments; unsafe actions cause physical damage.
Existing methods either rely on runtime filters (computationally heavy, conservative) or reward shaping alone (insufficient for strict safety).
Policies trained with runtime filters often fail to internalize safety constraints, remaining dependent on the filter.

Concrete Example: A humanoid robot navigating stairs might propose an unstable footstep. A standard RL policy would execute it and fall. A runtime filter would correct it but burden the onboard computer. CBF-RL trains the policy to never propose the unstable step in the first place.

Key Novelty

Dual Approach: Training-Time Filtering + Barrier-Based Reward Shaping

Apply a closed-form Control Barrier Function (CBF) filter only during training to actively correct unsafe actions before execution.
Simultaneously punish the agent based on the filter's activation and distance to safety boundaries using a barrier-inspired reward.
The policy receives 'corrective supervision'—learning from the filtered action and the penalty—so it internalizes safety and acts safely at deployment without a filter.

Architecture

The CBF-RL training loop and deployment pipeline.

Evaluation Highlights

100% success rate in 2D navigation tasks with randomized dynamics, compared to 0% for nominal PPO and 55% for filter-only training.
Zero-shot transfer to a physical Unitree G1 humanoid robot, successfully navigating obstacles and climbing stairs where nominal policies failed.
Achieved robustness to 20% dynamics noise in navigation tasks without explicit robust training, outperforming baselines.

Breakthrough Assessment

8/10

Strong theoretical grounding (continuous-to-discrete proof) enabling a practical solution for high-dimensional robots. successfully demonstrates filter-free safety on hardware, addressing a major bottleneck in safe RL.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) with safety constraints defined by Control Barrier Functions.

Inputs: State x_k (e.g., robot joint positions, velocities)

Outputs: Action u_k (e.g., motor commands)

Pipeline Flow

Policy Proposal: Actor Network → Nominal Action
Safety Check (Training Only): CBF-QP Filter → Safe Action
Environment Step: Apply Safe Action → Next State + Reward
Reward Calculation: Nominal Reward + CBF Penalty → Total Reward

System Modules

Actor Network

Proposes nominal actions based on current state observation.

Model or implementation: MLP (standard PPO actor)

CBF Safety Filter (Training-Time Safety)

Modifies nominal actions to satisfy barrier constraints using a closed-form QP solution.

Model or implementation: Analytic Closed-Form QP Solution

Reward Shaper (Training-Time Safety)

Calculates penalty based on filter activation and barrier value.

Model or implementation: Mathematical Function (Eq. 23 in paper)

Novel Architectural Elements

Dual training structure: Coupling active action correction (filtering) with simultaneous reward feedback (shaping) to force policy internalization of constraints.
Closed-form analytic safety filter for RL: Utilizing continuous-time CBF conditions directly in discrete RL via a derived theoretical bound, avoiding numerical optimization solvers during training.

Modeling

Base Model: MLP (PPO Policy)

Training Method: PPO (Proximal Policy Optimization) with custom safety filter loop

Objective Functions:

Purpose: Maximize expected return with safety penalties.

Formally: Standard PPO objective maximizing E[sum(r_nominal + r_cbf)]
Purpose: Enforce safety during rollouts.

Formally: v_safe = argmin ||v - v_des||^2 s.t. grad(h)*v >= -alpha*h(x)

Key Hyperparameters:

gamma: 0.99
gae_lambda: 0.95
learning_rate: Not reported in the paper
+ 3 more
batch_size: Not reported in the paper
steps: 1500 (navigation task), Not reported for humanoid
parallel_envs: 4096 (navigation task)

Compute: Not explicitly reported in the paper (implies standard RL compute, likely GPU-accelerated simulation)

Comparison to Prior Work

vs. Reward Shaping Only: CBF-RL actively prevents unsafe actions during training rollouts, providing 'corrective supervision' rather than just negative feedback.
vs. Runtime Filtering: CBF-RL removes the filter at deployment, reducing computational load and conservatism while retaining safety.
vs. Lagrangian Methods: CBF-RL uses geometric barrier constraints directly rather than soft constraints or average cost constraints.
+ 1 more
vs. Robust CBF [not cited in paper]: Does not require explicit disturbance observers or robust CBF formulations; relies on domain randomization and internalized safety.

Limitations

Assumes availability of a valid Control Barrier Function (h) and reduced-order model, which can be hard to design for complex dynamics.
Theoretical guarantees hold for the filter but not strictly for the learned policy (probabilistic safety after filter removal).
Depends on simulation fidelity; sim-to-real gap may still introduce safety risks despite domain randomization.

Reproducibility

Code availability is not provided. Simulation environments (IsaacLab) and robot platform (Unitree G1) are standard. Theoretical proofs for the continuous-to-discrete relationship are provided in the appendix.

📊 Experiments & Results

Evaluation Setup

2D Navigation simulation and High-fidelity Humanoid Robot (Unitree G1) simulation/hardware.

Benchmarks:

Single Integrator Navigation (2D Obstacle Avoidance with randomized dynamics) [New]
Unitree G1 Locomotion (Humanoid Obstacle Avoidance & Stair Climbing) [New]

Metrics:

Success Rate
Safety Violation Rate / Collision Rate
Goal Reaching Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation study on 2D navigation comparing the Dual approach (CBF-RL) against baselines across randomized environments.
Single Integrator Navigation	Success Rate	0.0	1.0	+1.0
Single Integrator Navigation	Success Rate	0.55	1.0	+0.45
Single Integrator Navigation	Success Rate	0.86	1.0	+0.14
Robustness tests with 20% dynamics noise added to the environment.
Single Integrator Navigation (Noisy)	Success Rate	0.0	1.0	+1.0
Single Integrator Navigation (Noisy)	Success Rate	0.27	1.0	+0.73

Experiment Figures

Training curves (Success Rate vs. Steps) and Trajectory plots for 2D navigation.

Main Takeaways

Dual approach (Filter + Reward) is essential: Filtering provides safe examples, Rewards provide the incentive to mimic them.
Internalized Safety: Policies trained with CBF-RL can be deployed without safety filters and still maintain high safety rates, unlike 'Filter Only' baselines.
Robustness: The method shows strong zero-shot robustness to dynamics noise and domain randomization, crucial for sim-to-real transfer on humanoids.
Sim-to-Real: Validated on physical Unitree G1 robot for stair climbing and obstacle avoidance where nominal policies failed.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Control Barrier Functions (CBF)
Lyapunov stability (basics)
Quadratic Programming (QP)

Key Terms

CBF: Control Barrier Function—a mathematical function defining a safe set of states; if the state stays within the set, the system is safe.

Safety Filter: An algorithm (usually a QP) that modifies a proposed control action minimally to ensure it satisfies safety constraints.

PPO: Proximal Policy Optimization—a standard policy-gradient reinforcement learning algorithm.

DTCBF: Discrete-Time Control Barrier Function—a version of CBFs adapted for systems with discrete time steps.

QP: Quadratic Program—an optimization problem with a quadratic objective and linear constraints, used here to find the safest action closest to the desired one.

Forward Invariant: A property of a set where if a system starts inside the set, it will remain inside it for all future time.

Reduced-Order Model: A simplified physics model (e.g., inverted pendulum for a humanoid) used to make control calculations tractable.