← Back to Paper List

CTS: Concurrent Teacher-Student Reinforcement Learning for Legged Locomotion

Hongxi Wang, Haoxiang Luo, Wei Zhang, Hua Chen
Southern University of Science and Technology
IEEE Robotics and Automation Letters (2024)
RL

📝 Paper Summary

Legged Locomotion Sim-to-Real Transfer Blind Locomotion
CTS concurrently trains teacher and student policies in a shared reinforcement learning loop, allowing the student to learn from both teacher demonstrations and its own environmental interactions to improve robustness.
Core Problem
Conventional two-stage teacher-student training (train teacher via RL, then distill to student via SL) limits the student to merely imitating the teacher, often leading to suboptimal performance when student observations are limited.
Why it matters:
  • Students operating with only proprioception (blind locomotion) cannot perfectly imitate teachers who see terrain details, leading to performance gaps
  • Two-stage pipelines are cumbersome and prevent the student policy from adapting its behavior to its specific sensor limitations during the RL phase
  • Prior methods like ROA update encoders but freeze the policy network during adaptation, preventing full end-to-end optimization
Concrete Example: In a two-stage approach, a blind student robot tries to copy the exact foot placement of a teacher that 'sees' a step. Because the student can't see the step, it fails to match the teacher's latent state perfectly. CTS allows the student to adjust its own policy to handle the step robustly using only proprioception, rather than just failing to mimic the teacher.
Key Novelty
Concurrent Teacher-Student (CTS) Learning Architecture
  • Trains teacher and student agents simultaneously in parallel groups sharing the same policy and critic networks
  • The student learns via a composite objective: maximizing its own RL reward (exploring solutions viable for blind agents) while minimizing reconstruction loss to the teacher's privileged latent space
  • Eliminates the separate distillation phase, allowing the shared policy to find behaviors that work well for both privileged (teacher) and proprioceptive (student) inputs
Architecture
Architecture Figure Figure 2
The Concurrent Teacher-Student architecture diagram showing parallel Teacher and Student groups
Evaluation Highlights
  • Reduces average velocity tracking error by up to 20% compared to standard two-stage teacher-student methods on uneven terrains
  • Demonstrates robust sim-to-real transfer on both quadrupedal (Unitree Go1, Aliengo) and point-foot bipedal robots
  • Outperforms Regularized Online Adaptation (ROA) baselines in tracking accuracy and stability metrics
Breakthrough Assessment
7/10
Offers a streamlined, effective alternative to the dominant two-stage paradigm in legged RL. While an architectural evolution rather than a complete paradigm shift, the 20% error reduction and successful hardware deployment on diverse robots are significant.
×