Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion

📝 Paper Summary

Humanoid Robot Locomotion Sim-to-Real Transfer

Distillation-PPO (D-PPO) improves humanoid robot walking by training a student policy with a hybrid loss that combines imitation of a privileged teacher (DAgger) with continued reinforcement learning (PPO) to handle noise and surpass teacher limits.

Core Problem

Existing two-stage locomotion methods (imitation learning) restrict the student to the teacher's performance ceiling and handle sensor noise poorly, while end-to-end methods are unstable and difficult to train from scratch.

Why it matters:

Humanoid robots are inherently unstable and require precise control to navigate complex terrains (stairs, slopes) without falling.
Teacher policies trained with perfect simulation data often fail to guide students correctly when real-world sensors (depth cameras/LiDAR) introduce noise and occlusion.
Pure imitation prevents the student policy from adapting or improving beyond the teacher, contradicting the core goal of reinforcement learning to find optimal behaviors.

Concrete Example: A teacher policy uses perfect terrain data to step exactly on a safe spot. A student policy relying on noisy real-world depth data sees a slightly different terrain geometry. If the student strictly imitates the teacher's foot placement (DAgger), it might step on a dangerous edge. D-PPO allows the student to adjust its action using RL rewards to find a safe step despite the noisy input.

Key Novelty

Distillation-PPO (D-PPO) Hybrid Loss

Combines supervised imitation loss (DAgger) with reinforcement learning loss (PPO) during the student training stage.
Uses the teacher's actions as a regularization signal to guide convergence, while allowing the PPO component to explore and optimize rewards, enabling the student to adapt to partial observability and potentially outperform the teacher.

Architecture

Schematic diagram of the D-PPO training framework, illustrating the two-stage process.

Evaluation Highlights

Demonstrates successful sim-to-real transfer on the 'Tien Kung' humanoid robot across various terrains (qualitative result).
Achieves higher training efficiency and stability in simulation compared to end-to-end methods (qualitative result).
Exhibits robustness to sensor noise by continuing to learn in the POMDP setting rather than just mimicking the MDP teacher (qualitative result).

Breakthrough Assessment

5/10

A solid incremental improvement combining two standard techniques (DAgger and PPO) to address a specific limitation in robotic sim-to-real transfer. While effective, the components are well-known.

⚙️ Technical Details

Problem Definition

Setting: Locomotion control modeled as a Partially Observable Markov Decision Process (POMDP) for the student and a fully observable MDP for the teacher.

Inputs: Proprioception (joint angles/velocities) and Exteroception (Elevation Map compressed into scan dots).

Outputs: Target joint positions for the humanoid robot.

Pipeline Flow

Sensors (LiDAR + Depth Camera) → Perception Module
Perception Module → Elevation Map → Scan Dots (Noisy)
History Encoder → Latent State
Student Policy (MLP) → Joint Actions

System Modules

Perception Module

Processes raw sensor data into a structured terrain representation

Model or implementation: Kalman Filter-based Elevation Mapping

Encoders

Compresses high-dimensional inputs into latent features

Model or implementation: Conv1D (for scan dots) and MLP (for proprioception history)

Student Policy

Generates locomotion commands

Model or implementation: Multi-Layer Perceptron (MLP)

Novel Architectural Elements

Integration of PPO and DAgger loss functions into a single update step for the student policy, rather than sequential phases.

Modeling

Base Model: Custom MLP and CNN encoders

Training Method: Two-stage training: (1) Teacher RL on ideal MDP, (2) Student Distillation-PPO on noisy POMDP

Objective Functions:

Purpose: Minimize difference between teacher and student actions (Imitation).

Formally: Mean Squared Error (MSE) between teacher actions and student actions.
Purpose: Maximize cumulative reward while ensuring stability (RL).

Formally: PPO clipped surrogate objective.
Purpose: Combine objectives.

Formally: L = L_PPO + alpha * L_Value + beta * L_DAgger + Entropy_Bonus

Training Data:

Teacher trained in simulation with perfect state info.
Student trained in simulation with Gaussian noise added to scan dots to bridge sim-to-real gap.

Key Hyperparameters:

scan_dots_dimension: 441
latent_dimension: 32
history_length: 50 frames
+ 1 more
sample_interval: 1 meter

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard DAgger: D-PPO adds a reinforcement learning objective (PPO) during the student phase, allowing the student to deviate from the teacher to handle noise or optimize rewards better.
vs. End-to-End RL: D-PPO uses the teacher's policy as a regularization term, stabilizing the learning process compared to learning from scratch in a POMDP.

Limitations

Depends on a teacher policy; if the teacher is poor, the student's starting point is compromised.
Requires tuning of coefficients to balance imitation loss and RL loss.
Specific quantitative performance metrics (speed, failure rate) are not present in the provided text snippets.

Reproducibility

No replication artifacts mentioned in the paper. Code, weights, and specific reward weights (alpha/beta coefficients) are not provided in the text.

📊 Experiments & Results

Evaluation Setup

Simulation training followed by real-world deployment on the 'Tien Kung' humanoid robot.

Benchmarks:

Simulated Terrain Traversal (Locomotion over slopes, steps, and uneven ground) [New]
Real-world Deployment (Walking on physical terrains) [New]

Metrics:

Training efficiency
Stability
Robustness
Generalization
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Snapshots of the humanoid robot Tien Kung walking on different terrains.

Main Takeaways

The D-PPO framework successfully enables the Tien Kung humanoid robot to walk on complex terrains in the real world, including slopes and steps.
Combining teacher supervision with RL rewards (D-PPO) provides higher training efficiency and stability compared to end-to-end methods which struggle with convergence in POMDP settings.
The student policy trained with D-PPO is more robust to sensor noise and real-world discrepancies than a student trained via pure imitation (DAgger), as it can adapt its behavior to maximize rewards.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Imitation Learning (DAgger)
Robotic Locomotion (Gait phases, generalized coordinates)

Key Terms

D-PPO: Distillation-PPO—The proposed framework blending imitation learning (Distillation) with Proximal Policy Optimization.

DAgger: Dataset Aggregation—An imitation learning algorithm where the student policy is trained on data collected by the student but labeled by the teacher.

POMDP: Partially Observable Markov Decision Process—A scenario where the agent does not know the full state of the world (e.g., noisy terrain data) and must infer it.

Scan Dots: A 1D vector representation of the terrain height map sampled around the robot, used as a compact sensory input.

LIO: LiDAR-Inertial Odometry—A method for estimating a robot's position and orientation by combining LiDAR scan matching with inertial measurement unit (IMU) data.

Elevation Map: A 2.5D grid map where each cell contains the height of the terrain at that location.