IEEE/RJS International Conference on Intelligent RObots and Systems
(2025)
RL
📝 Paper Summary
Humanoid Robot LocomotionSim-to-Real Transfer
Distillation-PPO (D-PPO) improves humanoid robot walking by training a student policy with a hybrid loss that combines imitation of a privileged teacher (DAgger) with continued reinforcement learning (PPO) to handle noise and surpass teacher limits.
Core Problem
Existing two-stage locomotion methods (imitation learning) restrict the student to the teacher's performance ceiling and handle sensor noise poorly, while end-to-end methods are unstable and difficult to train from scratch.
Why it matters:
Humanoid robots are inherently unstable and require precise control to navigate complex terrains (stairs, slopes) without falling.
Teacher policies trained with perfect simulation data often fail to guide students correctly when real-world sensors (depth cameras/LiDAR) introduce noise and occlusion.
Pure imitation prevents the student policy from adapting or improving beyond the teacher, contradicting the core goal of reinforcement learning to find optimal behaviors.
Concrete Example:A teacher policy uses perfect terrain data to step exactly on a safe spot. A student policy relying on noisy real-world depth data sees a slightly different terrain geometry. If the student strictly imitates the teacher's foot placement (DAgger), it might step on a dangerous edge. D-PPO allows the student to adjust its action using RL rewards to find a safe step despite the noisy input.
Key Novelty
Distillation-PPO (D-PPO) Hybrid Loss
Combines supervised imitation loss (DAgger) with reinforcement learning loss (PPO) during the student training stage.
Uses the teacher's actions as a regularization signal to guide convergence, while allowing the PPO component to explore and optimize rewards, enabling the student to adapt to partial observability and potentially outperform the teacher.
Architecture
Schematic diagram of the D-PPO training framework, illustrating the two-stage process.
Evaluation Highlights
Demonstrates successful sim-to-real transfer on the 'Tien Kung' humanoid robot across various terrains (qualitative result).
Achieves higher training efficiency and stability in simulation compared to end-to-end methods (qualitative result).
Exhibits robustness to sensor noise by continuing to learn in the POMDP setting rather than just mimicking the MDP teacher (qualitative result).
Breakthrough Assessment
5/10
A solid incremental improvement combining two standard techniques (DAgger and PPO) to address a specific limitation in robotic sim-to-real transfer. While effective, the components are well-known.
⚙️ Technical Details
Problem Definition
Setting: Locomotion control modeled as a Partially Observable Markov Decision Process (POMDP) for the student and a fully observable MDP for the teacher.
Inputs: Proprioception (joint angles/velocities) and Exteroception (Elevation Map compressed into scan dots).
Outputs: Target joint positions for the humanoid robot.
vs. Standard DAgger: D-PPO adds a reinforcement learning objective (PPO) during the student phase, allowing the student to deviate from the teacher to handle noise or optimize rewards better.
vs. End-to-End RL: D-PPO uses the teacher's policy as a regularization term, stabilizing the learning process compared to learning from scratch in a POMDP.
Limitations
Depends on a teacher policy; if the teacher is poor, the student's starting point is compromised.
Requires tuning of coefficients to balance imitation loss and RL loss.
Specific quantitative performance metrics (speed, failure rate) are not present in the provided text snippets.
Reproducibility
No replication artifacts mentioned in the paper. Code, weights, and specific reward weights (alpha/beta coefficients) are not provided in the text.
📊 Experiments & Results
Evaluation Setup
Simulation training followed by real-world deployment on the 'Tien Kung' humanoid robot.
Benchmarks:
Simulated Terrain Traversal (Locomotion over slopes, steps, and uneven ground) [New]
Real-world Deployment (Walking on physical terrains) [New]
Metrics:
Training efficiency
Stability
Robustness
Generalization
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Snapshots of the humanoid robot Tien Kung walking on different terrains.
Main Takeaways
The D-PPO framework successfully enables the Tien Kung humanoid robot to walk on complex terrains in the real world, including slopes and steps.
Combining teacher supervision with RL rewards (D-PPO) provides higher training efficiency and stability compared to end-to-end methods which struggle with convergence in POMDP settings.
The student policy trained with D-PPO is more robust to sensor noise and real-world discrepancies than a student trained via pure imitation (DAgger), as it can adapt its behavior to maximize rewards.
DAgger: Dataset Aggregation—An imitation learning algorithm where the student policy is trained on data collected by the student but labeled by the teacher.
POMDP: Partially Observable Markov Decision Process—A scenario where the agent does not know the full state of the world (e.g., noisy terrain data) and must infer it.
Scan Dots: A 1D vector representation of the terrain height map sampled around the robot, used as a compact sensory input.
LIO: LiDAR-Inertial Odometry—A method for estimating a robot's position and orientation by combining LiDAR scan matching with inertial measurement unit (IMU) data.
Elevation Map: A 2.5D grid map where each cell contains the height of the terrain at that location.