← Back to Paper List

SoloParkour: Constrained Reinforcement Learning for Visual Locomotion from Privileged Experience

Elliot Chane-Sane, Joseph Amigo, T. Flayols, Ludovic Righetti, Nicolas Mansard
Machines in Motion Laboratory, New York University, USA, Artificial and Natural Intelligence Toulouse Institute, Toulouse, France
Conference on Robot Learning (2024)
RL MM

📝 Paper Summary

Legged Locomotion Visual Sim-to-Real Transfer Constrained Reinforcement Learning
SoloParkour trains a safe, agile visual locomotion policy for a low-cost quadruped by using constrained RL and warm-starting off-policy learning with demonstrations from a privileged teacher.
Core Problem
Training agile visual locomotion policies is difficult because depth rendering is computationally expensive for RL, and standard distillation methods fail when privileged information cannot be inferred from vision (e.g., due to occlusions).
Why it matters:
  • Low-cost robots like Solo-12 are fragile and require strict safety constraints to prevent hardware damage during agile maneuvers
  • Distilling privileged policies into visual ones often leads to sub-optimal behaviors when the visual policy cannot reconstruct the teacher's privileged knowledge (the observability gap)
  • Direct RL from pixels is typically too sample-inefficient for complex locomotion tasks due to slow rendering speeds
Concrete Example: A privileged teacher might see an obstacle obscured behind another object and plan accordingly. A visual policy trained via simple cloning will fail to replicate this behavior because it cannot see the hidden obstacle, leading to a collision, whereas an RL agent trained on pixels would learn to gather more information first.
Key Novelty
Constrained RL with Privileged Warm-Start (SoloParkour)
  • Formulate parkour as a constrained RL problem to enforce physical limits (torque, velocity) directly, ensuring safety without complex reward tuning
  • Train a privileged policy first using cheap geometric data, then use it to generate a buffer of experience to warm-start an off-policy RL algorithm (DDPG derivative)
  • Switch to training from depth pixels using this warm-start buffer mixed with new online data, allowing the agent to adapt its behavior to actual visual limitations rather than blindly copying the teacher
Architecture
Architecture Figure Figure 3
The two-stage training pipeline: (1) Training a privileged policy on geometric data, and (2) using its experience to warm-start an off-policy RL agent that learns from depth images.
Evaluation Highlights
  • Clears obstacles 1.5x the robot's height (36cm height vs 24cm robot) on a real Solo-12 robot
  • Achieves 100% success rate on 40cm jumps in simulation, matching the performance of the privileged teacher
  • Successfully transfers agile skills (walking, climbing, leaping, crawling) to the real world using only onboard depth sensing
Breakthrough Assessment
8/10
Significant achievement in deploying agile parkour on a hardware-constrained, low-cost robot. The method cleverly bypasses the 'distillation gap' and rendering costs, enabling true end-to-end RL from pixels.
×