CaT: Constraints as Terminations for Legged Locomotion Reinforcement Learning

📝 Paper Summary

Legged Locomotion Constrained Reinforcement Learning Safe Reinforcement Learning

CaT reformulates physical constraints as stochastic termination probabilities in reinforcement learning, downscaling future rewards based on constraint violation magnitude to enforce safety and style without complex reward tuning.

Core Problem

Standard RL for legged locomotion struggles to enforce hard constraints (like torque limits or foot height) without labor-intensive reward shaping or complex constrained optimization algorithms.

Why it matters:

Reward shaping requires tuning dozens of conflicting terms, where maximizing task performance often compromises constraint adherence.
Existing constrained RL methods (like Lagrangian approaches) often introduce instability or require additional critic networks, increasing computational overhead.
Violating physical constraints on real hardware can damage robots or lead to unsafe behavior during sim-to-real transfer.

Concrete Example: In standard RL, preventing a robot's knees from banging the ground requires manually tuning a negative reward weight. If the weight is too low, the robot ignores it; if too high, it stops moving entirely. CaT instead treats knee contact as a chance to terminate future rewards, naturally discouraging the behavior without weight tuning.

Key Novelty

Constraints as Terminations (CaT)

Reformulates constraints as a probability of terminating the episode (from the learner's perspective) rather than just a negative reward penalty.
Scales the discount factor of future rewards by (1 - probability of termination), where the probability increases with the magnitude of constraint violation.
Provides a dense learning signal by allowing the agent to 'survive' minor violations with reduced expected returns, rather than abruptly ending the episode on every violation.

Evaluation Highlights

CaT enforces 0.0% constraint violation rate on critical safety constraints (e.g., joint limits) on the real Solo-12 robot, compared to frequent violations in standard PPO baselines.
Achieves higher average velocity and lower energy consumption than baselines while strictly adhering to style constraints like foot clearance.
Successfully traverses stairs, slopes, and platforms on physical hardware where unconstrained baselines fail or exhibit unsafe behaviors.

Breakthrough Assessment

7/10

A refreshingly simple and effective method that solves a major pain point in robot learning (constraint satisfaction) without adding algorithmic complexity. Validated on real hardware.

⚙️ Technical Details

Problem Definition

Setting: Infinite discounted Markov Decision Process with a set of constraint functions

Inputs: Proprioceptive state (joint positions, velocities), previous action, commands, and height-scan of terrain

Outputs: Joint position offsets (actions) for PD controller

Pipeline Flow

State Observation -> Policy Network -> Action (Joint Offsets) -> PD Controller -> Torques
Constraint Check -> Termination Probability Calculation -> Reward Downscaling (Training only)

System Modules

Policy Network

Map states to actions

Model or implementation: MLP (Multi-Layer Perceptron)

Constraint Module

Compute probability of termination based on constraint violations

Model or implementation: Analytical function (Eq. 6)

Novel Architectural Elements

Integration of constraint violation magnitude directly into the discount factor via stochastic termination probability, effectively reshaping the return without auxiliary cost critics.

Modeling

Base Model: PPO (Proximal Policy Optimization)

Training Method: PPO with modified termination signal (CaT)

Objective Functions:

Purpose: Maximize expected return where future rewards are probabilistically terminated by constraint violations.

Formally: Modified PPO objective using effectively discounted rewards r_t * delta_t.

Training Data:

Simulation environment (Isaac Gym / Jiminy) generating terrain data and robot dynamics

Key Hyperparameters:

discount_factor_gamma: 0.99
ppo_clip_epsilon: 0.2
learning_rate: 1e-3
+ 3 more
gae_lambda: 0.95
constraint_decay_rate_tau_c: 0.99
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Lagrangian: CaT is more stable and easier to tune (no PID tuning for multipliers), and avoids the oscillatory behavior often seen in primal-dual methods.
vs. Reward Shaping: CaT treats constraints as hard boundaries (probabilistically) rather than soft costs, reducing the need to balance task vs. safety weights.
vs. ET-MDP [not cited in paper]: Unlike simple Early-Terminated MDPs which stop on any violation, CaT uses stochastic termination proportional to violation magnitude, providing a denser gradient.

Limitations

Cannot guarantee zero constraint violations during training (soft enforcement via probability).
Requires careful selection of the maximum termination probability parameter.
Does not strictly guarantee safety during deployment if the distribution shifts significantly from training.

Reproducibility

Code: https://constraints-as-terminations.github.io

Code is publicly available at constraints-as-terminations.github.io. The paper describes the constraint formulations and PPO modifications in detail.

📊 Experiments & Results

Evaluation Setup

Quadruped locomotion in simulation and on real robot (Solo-12)

Benchmarks:

Flat Terrain (Locomotion)
Rough Terrain (Stairs, Slopes) (Locomotion)

Metrics:

Constraint Violation Rate
Tracking Error (Velocity)
Energy Consumption
Success Rate (Terrain Traversal)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis on flat terrain shows CaT achieves comparable tracking performance to baselines while significantly reducing constraint violations and energy usage.
Flat Terrain	Tracking Error (lin. vel.)	0.04	0.05	+0.01
Flat Terrain	Constraint Violation (Style)	0.52	0.00	-0.52
Flat Terrain	Mechanical Power (W)	58.2	45.1	-13.1

Experiment Figures

Plots of constraint variables (foot height, joint velocities) over time for CaT vs Baselines.

Main Takeaways

CaT successfully enforces both safety (hard) and style (soft) constraints on real hardware without complex tuning.
The method enables the robot to traverse challenging obstacles (stairs, slopes) that constrained baselines often fail to learn due to over-conservatism.
Stochastic terminations provide a dense enough signal to learn recovery behaviors, unlike sparse binary terminations.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, PPO)
Constrained Optimization
Robotics (Joint space, PD control)

Key Terms

Stochastic terminations: Ending the accumulation of future rewards with a certain probability at each step, used here to model constraint violations.

PPO: Proximal Policy Optimization—a popular policy gradient method for reinforcement learning.

Discount factor: A parameter (gamma) in RL that determines how much the agent cares about future rewards compared to immediate ones.

Sim-to-real: Transferring a policy learned in a physics simulator to a physical robot.

PD controller: Proportional-Derivative controller—a control loop mechanism widely used in industrial control systems.

Lagrangian methods: A mathematical technique for constrained optimization that uses Lagrange multipliers to enforce constraints, often used as a baseline in constrained RL.