Evaluation Setup
Constrained Reinforcement Learning in continuous control and autonomous driving simulators
Benchmarks:
- ZonesEnv (Navigation with logical regions (Safety Gymnasium))
- CARLA (Town02) (Autonomous driving simulation)
Metrics:
- Violation Rate (VR)
- Episodic Cost
- Route Completion Rate (RCR)
- Total Distance
- Statistical methodology: Experiments run with 3 random seeds; means reported.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ZonesEnv experiments demonstrate PPO-LTL's ability to maintain low violation rates compared to baselines that either ignore temporal rules or rely on brittle shielding. |
| ZonesEnv |
Hit-Wall Rate |
12.0% |
4.3% |
-7.7%
|
| ZonesEnv |
Unshown Violation Cost |
56.98 |
Not reported in the paper |
Not reported in the paper
|
| CARLA experiments show PPO-LTL achieves the best balance of safety and task progress, avoiding the reckless crashing of Shielding and the freezing behavior of other Safe RL baselines. |
| CARLA |
Collision Rate |
0.260 |
0.143 |
-0.117
|
| CARLA |
Collisions (Total Count) |
164.3 |
Low (implied) |
Not reported in the paper
|
| CARLA |
Route Completion Rate |
0.072 |
0.236 |
+0.164
|
Main Takeaways
- PPO-LTL consistently reduces safety violations while maintaining task performance, unlike PPO-Mask (deadlocks) or PPO-Shielding (reckless/crashing)
- Standard PPO-Lagrangian fails in these tasks because it lacks the 'memory' (LDBA state) to understand temporal constraints like 'A then B'
- Ablation studies confirm that careful balancing of LTL constraints is necessary; overly relaxed bounds lead to reckless driving
- The method incurs negligible computational overhead compared to standard PPO