← Back to Paper List

Safe reinforcement learning under temporal logic with reward design and quantum action selection

Mingyu Cai, Shaoping Xiao, Junchao Li, Z. Kan
Lehigh University, University of Iowa, University of Science and Technology of China
Scientific Reports (2023)
RL Reasoning Agent

📝 Paper Summary

Safe Reinforcement Learning Formal Methods in Control (LTL) Quantum-inspired Reinforcement Learning
A safe RL framework that enables agents to learn complex tasks formulated by temporal logic while avoiding unsafe states during training via automaton-based reward shaping and safety value estimation.
Core Problem
Standard RL lacks safety guarantees during the exploration phase (risking damage to the agent) and struggles with sparse rewards when learning complex high-level tasks defined by formal logic.
Why it matters:
  • Real-world systems like mobile robots cannot afford to visit unsafe states (e.g., hitting obstacles) even once during the training process
  • Sparse rewards in logic-based tasks (rewarded only upon full completion) make convergence difficult or impossible for standard algorithms
  • Existing logic-based RL methods often assume known models or fail to track progress correctly when using deterministic policies with standard automata
Concrete Example: A mobile robot must visit specific rooms in order (T1 -> T2 -> T3) while avoiding a control room (Us). A standard RL agent might repeatedly enter the control room (unsafe) while exploring, or fail to learn the sequence because it receives no reward until the entire sequence T1-T2-T3 is completed.
Key Novelty
Embedded Automaton & Safety Shielding
  • Introduces Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) which augments standard automata with a tracking set to record unvisited goals, enabling deterministic policies to satisfy complex logic
  • Develops 'Safety Value Functions' that estimate the probability of staying safe based on visited states, acting as a shield to override unsafe exploration actions during training
  • Proposes a potential-based reward shaping mechanism derived from the automaton structure to provide dense rewards without altering the optimal policy
Breakthrough Assessment
6/10
Offers a rigorous theoretical combination of formal methods and safe RL. The E-LDGBA and safety value concepts are strong contributions to the specific niche of logical control, though the quantum aspect appears more heuristic.
×