Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control

📝 Paper Summary

Legged Locomotion Sim-to-Real Transfer

A reinforcement learning framework using a dual-history architecture—combining short-term feedback with long-term input/output logs—enables a bipedal robot to perform agile walking, running, and jumping with zero-shot sim-to-real transfer.

Core Problem

Controlling bipedal robots is difficult due to their high-dimensional, nonlinear, and underactuated dynamics, where distinct skills (like walking vs. jumping) typically require specialized, handcrafted contact plans.

Why it matters:

Traditional model-based optimal control is computationally expensive and struggles with real-time whole-body planning for diverse agile skills
Prior RL methods often focus on single skills (e.g., just walking) or fail to transfer highly dynamic aperiodic motions (like jumping) to the real world without fine-tuning

Concrete Example: Running introduces a flight phase where the robot is underactuated and unstable; standard walking controllers that rely on orbital stability fail here, and model-based methods often cannot re-plan contact sequences fast enough for real-world disturbances.

Key Novelty

Dual-History Policy Architecture with Multi-Stage Training

Incorporates two history streams: a 'short' history (4 steps) for immediate feedback control and a 'long' history (66 steps/2 seconds) processed via a CNN to implicitly identify system dynamics
Utilizes a training curriculum that moves from single-task learning to 'task randomization' (varying goals) and finally 'dynamics randomization', fostering robustness and disturbance compliance

Architecture

The control policy architecture showing the dual-history processing streams.

Evaluation Highlights

Running: Achieved a 400-meter dash in 2 minutes 34 seconds on the Cassie robot (approx 2.6 m/s), outperforming prior RL methods that could not sustain turning or long-distance running
Jumping: Demonstrated a standing long jump of 1.4m and a vertical box jump of 0.44m, significantly exceeding prior controller capabilities (e.g., 0.41m max leap)
Robustness: Zero-shot transfer to real hardware with the ability to recover from unexpected external forces and adapt to hardware changes over a one-year timespan

Breakthrough Assessment

9/10

Demonstrates unprecedented versatility on a bipedal platform, unifying walking, running, and jumping in one framework with successful zero-shot transfer and impressive physical benchmarks (400m dash).

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) for locomotion control

Inputs: Observable states 'o' (motor positions/velocities, base orientation/velocity), Command 'c', Reference motion 'q^r', Short I/O history (4 steps), Long I/O history (66 steps)

Outputs: Desired motor positions 'q^d_m' (Action 'a')

Pipeline Flow

State Estimation (IMU/Encoders)
History Processing (Short + Long)
Policy Inference (CNN + MLP)
Low-level Control (PD)

System Modules

Long History Encoder

Compresses 2 seconds of I/O history to implicitly identify system dynamics/parameters

Model or implementation: 1D CNN (2 hidden layers: [6, 32, 3] and [4, 16, 2])

Control Policy (Base Network)

Determines optimal motor positions based on current command, reference motion, and history

Model or implementation: MLP (2 hidden layers, 512 tanh units)

Joint-Level Controller

Converts desired positions into motor torques at high frequency (2 kHz)

Model or implementation: PD Controller

Novel Architectural Elements

Dual-history input structure: explicitly separates immediate feedback (short history) from implicit system identification (long history encoded via CNN) within the policy inputs

Modeling

Base Model: Custom MLP + 1D CNN architecture

Training Method: Model-free Reinforcement Learning

Objective Functions:

Purpose: Maximize expected return (cumulative discounted reward) to learn optimal policy.

Formally: E[sum(gamma^t * r_t)]

Key Hyperparameters:

policy_frequency: 33 Hz
pd_controller_frequency: 2 kHz
history_length_short: 4 timesteps (approx 0.1s)
+ 2 more
history_length_long: 66 timesteps (approx 2s)
action_std_dev: 0.1 (fixed)

Compute: Not reported in the paper

Comparison to Prior Work

vs. HZD: Ours handles aperiodic skills (jumping) and adapts to dynamics changes online without strictly defined periodic orbits
vs. Single-skill RL: Ours uses a unified architecture for diverse skills (walk, run, jump) with task randomization
vs. RMA [not cited in paper]: Ours uses direct end-to-end training with I/O history rather than a two-stage teacher-student distillation process

Limitations

Requires skill-specific reference motions (from mocap or optimization) as input
Policy queries are relatively low frequency (33 Hz) compared to the PD loop (2 kHz)
Does not explicitly model contact switches, relying on RL to learn them implicitly

Reproducibility

No public code repository link provided in the paper text. Reference motions and training curriculum are described conceptually. Experimental videos are available.

📊 Experiments & Results

Evaluation Setup

Real-world deployment on the Cassie bipedal robot and simulation comparison

Benchmarks:

Cassie Hardware Experiments (Real-world locomotion (Walk, Run, Jump))

Metrics:

Finish time (running)
Leap distance (jumping)
Flight phase duration
Maximum resistible push force (robustness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Real-world running performance comparisons showing superior speed and endurance compared to prior RL and Optimization methods.
Cassie Hardware Experiments	400m Dash Finish Time	Not capable of turning	2 min 34 sec	Success vs Failure
Cassie Hardware Experiments	100m Dash Finish Time	24.73s	27.06s	+2.33s
Jumping capability comparisons showing significantly increased range and agility.
Cassie Hardware Experiments	Maximum Leap Distance	0.41m	1.4m	+0.99m
Cassie Hardware Experiments	Longest Flight Phase	0.42s	0.58s	+0.16s

Experiment Figures

Collage of Cassie performing three distinct skills in the real world: Walking (push recovery), Running, and Jumping.

Main Takeaways

The dual-history architecture enables zero-shot transfer for both periodic (walking/running) and aperiodic (jumping) skills, a versatility previously difficult to achieve with a single framework.
Task randomization during training is identified as a key source of robustness, allowing the robot to generalize to unseen disturbances effectively.
The system demonstrates long-term consistency, with controllers performing reliably on hardware over a one-year period despite wear and tear.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (POMDP formulation)
Robotics Dynamics (floating base, underactuated systems)
Control Theory (PD controllers, system identification)

Key Terms

I/O History: Input/Output History—a sequence of past actions taken by the robot and the resulting sensor observations, used to infer system state

PD Controller: Proportional-Derivative Controller—a control loop mechanism that calculates an error value as the difference between a desired setpoint and a measured process variable

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment

Sim-to-real: Transferring a policy trained in a physics simulation to a physical robot, often requiring techniques to bridge the 'reality gap'

DoF: Degrees of Freedom—the number of independent parameters that define the configuration of a mechanical system (Cassie has 20 DoF)

Floating base: A robot base that is not fixed to the ground (like a humanoid torso), possessing 6 degrees of freedom (position and orientation)

Zero-shot transfer: Deploying a learned policy directly to the target environment (real robot) without any further training or fine-tuning on the real hardware