CTS: Concurrent Teacher-Student Reinforcement Learning for Legged Locomotion

📝 Paper Summary

Legged Locomotion Sim-to-Real Transfer Blind Locomotion

CTS concurrently trains teacher and student policies in a shared reinforcement learning loop, allowing the student to learn from both teacher demonstrations and its own environmental interactions to improve robustness.

Core Problem

Conventional two-stage teacher-student training (train teacher via RL, then distill to student via SL) limits the student to merely imitating the teacher, often leading to suboptimal performance when student observations are limited.

Why it matters:

Students operating with only proprioception (blind locomotion) cannot perfectly imitate teachers who see terrain details, leading to performance gaps
Two-stage pipelines are cumbersome and prevent the student policy from adapting its behavior to its specific sensor limitations during the RL phase
Prior methods like ROA update encoders but freeze the policy network during adaptation, preventing full end-to-end optimization

Concrete Example: In a two-stage approach, a blind student robot tries to copy the exact foot placement of a teacher that 'sees' a step. Because the student can't see the step, it fails to match the teacher's latent state perfectly. CTS allows the student to adjust its own policy to handle the step robustly using only proprioception, rather than just failing to mimic the teacher.

Key Novelty

Concurrent Teacher-Student (CTS) Learning Architecture

Trains teacher and student agents simultaneously in parallel groups sharing the same policy and critic networks
The student learns via a composite objective: maximizing its own RL reward (exploring solutions viable for blind agents) while minimizing reconstruction loss to the teacher's privileged latent space
Eliminates the separate distillation phase, allowing the shared policy to find behaviors that work well for both privileged (teacher) and proprioceptive (student) inputs

Architecture

The Concurrent Teacher-Student architecture diagram showing parallel Teacher and Student groups

Evaluation Highlights

Reduces average velocity tracking error by up to 20% compared to standard two-stage teacher-student methods on uneven terrains
Demonstrates robust sim-to-real transfer on both quadrupedal (Unitree Go1, Aliengo) and point-foot bipedal robots
Outperforms Regularized Online Adaptation (ROA) baselines in tracking accuracy and stability metrics

Breakthrough Assessment

7/10

Offers a streamlined, effective alternative to the dominant two-stage paradigm in legged RL. While an architectural evolution rather than a complete paradigm shift, the 20% error reduction and successful hardware deployment on diverse robots are significant.

⚙️ Technical Details

Problem Definition

Setting: Infinite-horizon partially observable Markov decision process (POMDP) for legged locomotion

Inputs: Proprioceptive observation o_t (IMU, joint states, commands) for Student; Full state s_t (including terrain heights, contact forces) for Teacher

Outputs: Action a_t (joint position targets for PD controllers)

Pipeline Flow

Environment Interaction (Parallel Teacher & Student Groups)
Encoding (Privileged Encoder for Teacher, Proprioceptive Encoder for Student)
Policy Execution (Shared Policy Network)
Optimization (Shared PPO + Reconstruction Loss)

System Modules

Privileged Encoder (Teacher) (Encoding)

Encodes full state (including terrain info) into latent representation z_t

Model or implementation: MLP (sizes [256, 128], ELU activation)

Proprioceptive Encoder (Student) (Encoding)

Encodes history of proprioceptive observations into latent representation z_t

Model or implementation: MLP (sizes [256, 128], ELU activation)

Shared Policy Network

Maps observation and latent representation to actions

Model or implementation: MLP (sizes [512, 256, 128], ELU activation)

Shared Critic Network

Estimates value function for PPO

Model or implementation: MLP (sizes [512, 256, 128], ELU activation)

Novel Architectural Elements

Dual-group asymmetric actor-critic where Teacher and Student share the exact same Policy and Critic networks
Concurrent training loop where Student updates the Policy via RL (PPO) *and* updates its Encoder via distillation simultaneously

Modeling

Base Model: Custom MLP Architectures for Encoders and Policy

Training Method: Concurrent PPO with auxiliary reconstruction loss

Objective Functions:

Purpose: Optimize policy to maximize reward for both Teacher and Student.

Formally: PPO-Clip objective L^t(θ, θ^t) and L^s(θ) summed over respective trajectories.
Purpose: Align Student's latent representation with Teacher's.

Formally: Reconstruction loss L_recon = || z_t^t - z_t^s ||^2.
Purpose: Estimate value function.

Formally: MSE loss between V_phi and GAE estimated returns.

Key Hyperparameters:

discount_factor_gamma: 0.99
gae_lambda: 0.95
ppo_clip_epsilon: 0.2
+ 5 more
learning_rate: 1e-3 (Encoders/Value), 1e-4 (Policy)
entropy_coefficient: 0.01
num_steps_per_env: 24
batch_size: 12000 (approx, derived from env count)
latent_dimension: 32

Compute: Not reported in the paper

Comparison to Prior Work

vs. TS: CTS updates Student policy via RL objectives concurrently, not just imitation
vs. ROA: CTS updates the policy network during student interaction, whereas ROA freezes it
vs. HIM/DreamWaQ: CTS focuses on concurrent policy optimization rather than just representation learning mechanics

Limitations

No statistical significance tests reported for the 20% improvement claim
Relies on the assumption that a single policy network can serve both privileged and non-privileged representations effectively
Does not explicitly handle visual exteroception (e.g., depth cameras), focusing only on blind locomotion/proprioception

Reproducibility

Code: https://clearlab-sustech.github.io/concurrentTS/

Code is publicly available at project page. Project page includes videos. Hyperparameters for networks and PPO are provided in Table I and Table II.

📊 Experiments & Results

Evaluation Setup

Simulation in Isaac Gym and real-world testing on Unitree Go1, Aliengo, and a custom point-foot biped

Benchmarks:

Uneven Terrains (Simulation) (Velocity tracking over rough terrain)
Physical Hardware Walk (Real-world robustness test)

Metrics:

Linear Velocity Tracking Error (RMSE)
Angular Velocity Tracking Error (RMSE)
Torque smoothness / Action smoothness (qualitative/auxiliary)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Uneven Terrains (Simulation)	Lin. Vel. Error (m/s)	0.089	0.066	-0.023
Uneven Terrains (Simulation)	Lin. Vel. Error (m/s)	0.076	0.066	-0.010
Uneven Terrains (Simulation)	Ang. Vel. Error (rad/s)	0.061	0.048	-0.013

Experiment Figures

Training curves comparing CTS, TS (Teacher-Student), and ROA

Main Takeaways

CTS consistently outperforms two-stage TS and ROA baselines in velocity tracking accuracy across different robot morphologies (Quadruped, Biped)
The student policy in CTS learns to handle 'blind' scenarios better by optimizing RL rewards directly, rather than just imitating a teacher who can 'see'
Real-world experiments confirm robustness to pushes, slippery surfaces, and stairs without visual sensors

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Teacher-Student Distillation
Sim-to-Real Transfer in Robotics

Key Terms

PPO: Proximal Policy Optimization—a policy gradient method for RL that keeps updates stable via clipping

Teacher-Student: A learning paradigm where a privileged agent (Teacher) guides a limited-sensor agent (Student)

Proprioception: Sensing the robot's own internal state (joint angles, body orientation) without external vision

Privileged Information: Data available in simulation (exact terrain, friction) but not on the real robot

Sim-to-Real: Transferring policies learned in simulation to physical hardware

Latent Representation: A compressed vector encoding the state/environment features, output by the encoders

ROA: Regularized Online Adaptation—a baseline method that adapts encoders online but freezes the policy