← Back to Paper List

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

Maifang Zhang, Hang Yu, Qian Zuo, Cheng Wang, Vaishak Belle, Fengxiang He
Affiliations not extracted from provided text
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Safe Reinforcement Learning Formal Verification in RL
PPO-LTL integrates temporal logic constraints into reinforcement learning by converting formal rules into cost signals via automata monitors, optimizing policies using a Lagrangian scheme.
Core Problem
Standard safe RL methods require constraints to be expressed as simple analytic inequalities (e.g., math formulas based on immediate state), which cannot capture complex temporal regulations found in robotics.
Why it matters:
  • Real-world regulations (like the British Highway Code) involve sequences of events, not just static thresholds, making them incompatible with standard constrained optimization
  • Existing methods like Shielding are too conservative or restrict exploration, while standard PPO-Lagrangian lacks the memory to handle temporal dependencies
Concrete Example: A traffic rule requires a car to 'stop at a red light until it turns green.' A standard constrained RL agent failing to model the 'until' temporal dependency might optimize only for immediate speed. In contrast, PPO-LTL uses an automaton to track the sequence: if the car moves while the light is still red (before green), the monitor triggers a specific violation cost.
Key Novelty
Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO-LTL)
  • Translates abstract temporal safety rules (LTL) into automata (LDBA) that act as runtime monitors, evolving synchronously with the environment
  • Uses a logic-to-cost mechanism to convert automaton-detected violations into dense numerical penalties
  • Optimizes the policy using a Lagrangian primal-dual method that dynamically balances maximizing reward with minimizing these temporal violation costs
Evaluation Highlights
  • Achieves lowest collision rate (0.143) in CARLA autonomous driving, a 45% reduction compared to standard PPO
  • Maintains competitive route completion (0.236) in CARLA compared to baselines like PPO-Shielding (0.072), which drives recklessly then crashes
  • Reduces hit-wall rate to 4.3-4.7% in ZonesEnv, significantly outperforming PPO-Shielding (12.0%)
Breakthrough Assessment
8/10
Strong contribution bridging formal methods and RL. Successfully enables PPO to handle complex temporal constraints (LTL) with rigorous theoretical convergence guarantees.
×