SoloParkour: Constrained Reinforcement Learning for Visual Locomotion from Privileged Experience

📝 Paper Summary

Legged Locomotion Visual Sim-to-Real Transfer Constrained Reinforcement Learning

SoloParkour trains a safe, agile visual locomotion policy for a low-cost quadruped by using constrained RL and warm-starting off-policy learning with demonstrations from a privileged teacher.

Core Problem

Training agile visual locomotion policies is difficult because depth rendering is computationally expensive for RL, and standard distillation methods fail when privileged information cannot be inferred from vision (e.g., due to occlusions).

Why it matters:

Low-cost robots like Solo-12 are fragile and require strict safety constraints to prevent hardware damage during agile maneuvers
Distilling privileged policies into visual ones often leads to sub-optimal behaviors when the visual policy cannot reconstruct the teacher's privileged knowledge (the observability gap)
Direct RL from pixels is typically too sample-inefficient for complex locomotion tasks due to slow rendering speeds

Concrete Example: A privileged teacher might see an obstacle obscured behind another object and plan accordingly. A visual policy trained via simple cloning will fail to replicate this behavior because it cannot see the hidden obstacle, leading to a collision, whereas an RL agent trained on pixels would learn to gather more information first.

Key Novelty

Constrained RL with Privileged Warm-Start (SoloParkour)

Formulate parkour as a constrained RL problem to enforce physical limits (torque, velocity) directly, ensuring safety without complex reward tuning
Train a privileged policy first using cheap geometric data, then use it to generate a buffer of experience to warm-start an off-policy RL algorithm (DDPG derivative)
Switch to training from depth pixels using this warm-start buffer mixed with new online data, allowing the agent to adapt its behavior to actual visual limitations rather than blindly copying the teacher

Architecture

The two-stage training pipeline: (1) Training a privileged policy on geometric data, and (2) using its experience to warm-start an off-policy RL agent that learns from depth images.

Evaluation Highlights

Clears obstacles 1.5x the robot's height (36cm height vs 24cm robot) on a real Solo-12 robot
Achieves 100% success rate on 40cm jumps in simulation, matching the performance of the privileged teacher
Successfully transfers agile skills (walking, climbing, leaping, crawling) to the real world using only onboard depth sensing

Breakthrough Assessment

8/10

Significant achievement in deploying agile parkour on a hardware-constrained, low-cost robot. The method cleverly bypasses the 'distillation gap' and rendering costs, enabling true end-to-end RL from pixels.

⚙️ Technical Details

Problem Definition

Setting: Infinite, discounted, constrained Markov Decision Process (CMDP)

Inputs: History of proprioception (joint positions/velocities), previous actions, command vector, and depth images

Outputs: Joint position offsets (converted to torques via PD controller)

Pipeline Flow

Privileged Training: Environment (State + Height Map) → PPO Agent → Privileged Policy
Data Generation: Privileged Policy → Interaction → Privileged Experience Buffer (contains Depth Images)
Visual Training: Privileged Buffer + Online Buffer → DDPG Agent (Pixel-based) → Visual Policy

System Modules

Privileged Policy

Learn optimal behaviors using full state information to generate demonstrations

Model or implementation: Multi-Layer Perceptron (MLP)

Visual Policy

Learn to control robot from depth images, using privileged data as a warm start

Model or implementation: ConvNet + GRU + MLP head

Novel Architectural Elements

Hybrid RL pipeline: Uses an on-policy teacher (PPO) to generate data for an off-policy student (DDPG) to solve the sample efficiency problem of visual RL
Integration of Constraints as Terminations (CaT) into the off-policy RLPD framework for safe visual learning

Modeling

Base Model: Custom ConvNet + GRU architecture

Training Method: Two-stage process: (1) Privileged PPO, (2) Visual DDPG with Privileged Warm-Start

Objective Functions:

Purpose: Maximize discounted returns while satisfying constraints.

Formally: max E[sum(gamma^t * r_t)] subject to E[sum(gamma^t * c_i(s,a))] = 0
Purpose: Enforce safety limits.

Formally: Constraints modeled as terminations (CaT), ending episodes upon violation.

Key Hyperparameters:

image_resolution: 64x64
discount_factor_gamma: 0.99
batch_size: 256
+ 1 more
replay_buffer_ratio: 50% privileged / 50% online

Compute: Not reported in the paper

Comparison to Prior Work

vs. Distillation: SoloParkour uses RL for the visual phase (initialized with teacher data), allowing the student to deviate from the teacher to handle partial observability, whereas distillation forces mimicry.
vs. End-to-End RL: SoloParkour uses privileged demonstrations to bypass the prohibitive sample complexity of learning from scratch with rendered images.
vs. Planners: SoloParkour learns a single multi-task policy for all parkour skills.

Limitations

Relies on the availability of a privileged simulator that can accurately model the physics and generate useful demonstrations.
Training requires rendering depth images in simulation, which is computationally heavier than state-based learning, even with the warm-start speedup.
The 'observability gap' problem is mitigated but not solved; if the visual sensor simply cannot see a hazard, the policy may still fail.
No direct code release makes implementation difficult.

Reproducibility

No code or pretrained models provided. The paper describes the architecture (ConvNet + GRU), algorithms (PPO, DDPG, RLPD), and reward/constraint formulations in detail, but exact replication would require re-implementing the custom RL pipeline and IsaacGym environment.

📊 Experiments & Results

Evaluation Setup

Simulation in IsaacGym and Real-world deployment on Solo-12 robot

Benchmarks:

Parkour Terrains (Sim) (Locomotion success rate on Walking, Climbing, Leaping, Crawling) [New]
Real-World Deployment (Qualitative success and robustness on physical obstacles) [New]

Metrics:

Success Rate (Simulation)
Maximum traversable obstacle difficulty
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results compare the proposed SoloParkour method against a privileged teacher (upper bound) and a standard behavioral cloning (BC) baseline across four parkour tasks.
Simulation (Leaping 40cm gap)	Success Rate	0.0	1.0	+1.0
Simulation (Crawling 20cm height)	Success Rate	0.6	0.9	+0.3
Simulation (Climbing 0.24m steps)	Success Rate	0.4	1.0	+0.6

Experiment Figures

Bar charts showing success rates of the Privileged Policy, Behavioral Cloning (BC), and SoloParkour (Ours) on four tasks (Walk, Climb, Leap, Crawl) at varying difficulties.

Main Takeaways

SoloParkour consistently outperforms standard distillation (Behavioral Cloning) across all difficult terrain types (leaping, crawling, climbing).
The method nearly matches the performance of the privileged teacher policy, indicating highly effective transfer of skills despite limited visual information.
Real-world experiments confirm the policy is safe and robust, respecting torque limits while performing aggressive maneuvers like jumping 40cm gaps and climbing 36cm obstacles (1.5x robot height).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (MDPs, policies, rewards)
Sim-to-Real transfer techniques
Constrained Optimization

Key Terms

PPO: Proximal Policy Optimization—an on-policy RL algorithm used here for the initial privileged teacher policy

DDPG: Deep Deterministic Policy Gradient—an off-policy RL algorithm used here for the visual policy to efficiently reuse data

CaT: Constraints as Terminations—a method to enforce safety constraints by terminating the episode when constraints are violated, treating them as terminal states

RLPD: Reinforcement Learning with Prior Data—a technique to accelerate RL by filling the replay buffer with demonstrations from a prior controller

privileged information: Data available only in simulation (e.g., exact terrain height maps, obstacle positions) used to train a teacher policy but unavailable to the real robot

distillation: A process where a 'student' neural network learns to mimic the output of a 'teacher' network; often used to transfer privileged behaviors to vision-based agents

observability gap: The discrepancy between what a privileged teacher knows (everything) and what a visual student can see (limited field of view, occlusions), making perfect imitation impossible

PD controller: Proportional-Derivative controller—a feedback control loop mechanism widely used in industrial control systems

REDQ: Randomized Ensembled Double Q-learning—an RL technique using an ensemble of critics to reduce overestimation bias, enabling high update-to-data ratios

sim-to-real: The process of transferring a policy trained in a physics simulator to a physical robot

warm-start: Initializing the training process with pre-collected data or pre-trained weights to speed up learning