← Back to Paper List

CaT: Constraints as Terminations for Legged Locomotion Reinforcement Learning

Elliot Chane-Sane, Pierre-Alexandre Léziart, T. Flayols, O. Stasse, P. Souéres, N. Mansard
LAAS-CNRS, Artificial and Natural Intelligence Toulouse Institute
IEEE/RJS International Conference on Intelligent RObots and Systems (2024)
RL

📝 Paper Summary

Legged Locomotion Constrained Reinforcement Learning Safe Reinforcement Learning
CaT reformulates physical constraints as stochastic termination probabilities in reinforcement learning, downscaling future rewards based on constraint violation magnitude to enforce safety and style without complex reward tuning.
Core Problem
Standard RL for legged locomotion struggles to enforce hard constraints (like torque limits or foot height) without labor-intensive reward shaping or complex constrained optimization algorithms.
Why it matters:
  • Reward shaping requires tuning dozens of conflicting terms, where maximizing task performance often compromises constraint adherence.
  • Existing constrained RL methods (like Lagrangian approaches) often introduce instability or require additional critic networks, increasing computational overhead.
  • Violating physical constraints on real hardware can damage robots or lead to unsafe behavior during sim-to-real transfer.
Concrete Example: In standard RL, preventing a robot's knees from banging the ground requires manually tuning a negative reward weight. If the weight is too low, the robot ignores it; if too high, it stops moving entirely. CaT instead treats knee contact as a chance to terminate future rewards, naturally discouraging the behavior without weight tuning.
Key Novelty
Constraints as Terminations (CaT)
  • Reformulates constraints as a probability of terminating the episode (from the learner's perspective) rather than just a negative reward penalty.
  • Scales the discount factor of future rewards by (1 - probability of termination), where the probability increases with the magnitude of constraint violation.
  • Provides a dense learning signal by allowing the agent to 'survive' minor violations with reduced expected returns, rather than abruptly ending the episode on every violation.
Evaluation Highlights
  • CaT enforces 0.0% constraint violation rate on critical safety constraints (e.g., joint limits) on the real Solo-12 robot, compared to frequent violations in standard PPO baselines.
  • Achieves higher average velocity and lower energy consumption than baselines while strictly adhering to style constraints like foot clearance.
  • Successfully traverses stairs, slopes, and platforms on physical hardware where unconstrained baselines fail or exhibit unsafe behaviors.
Breakthrough Assessment
7/10
A refreshingly simple and effective method that solves a major pain point in robot learning (constraint satisfaction) without adding algorithmic complexity. Validated on real hardware.
×