Real-world humanoid locomotion with reinforcement learning

📝 Paper Summary

Humanoid Robotics Legged Locomotion

A causal transformer controller, trained via large-scale reinforcement learning on randomized simulations, enables a blind humanoid robot to traverse diverse outdoor terrains and adapt to disturbances zero-shot.

Core Problem

Classical controllers for humanoids struggle to generalize to unstructured environments, while previous learning-based methods (like LSTMs or explicit estimators) often fail to capture the long-term context needed for robust adaptation.

Why it matters:

Humanoids have high potential for general-purpose labor but require controllers that function in diverse, unstructured real-world environments
Designing explicit estimators for every terrain property (friction, compliance) is brittle and difficult to scale

Concrete Example: When a blind robot's foot gets trapped by a step, classical controllers or simple policies often fail to react, leading to a fall. The proposed model uses history to detect the collision and lifts the leg higher on the next attempt.

Key Novelty

Causal Transformer for Locomotion

Hypothesizes that a history of proprioceptive observations and actions implicitly encodes environment properties (like terrain friction or obstacles).
Uses a Causal Transformer to process this history, allowing the policy to perform 'in-context learning'—adapting behavior (e.g., gait changes) at test time without updating weights.

Architecture

Inference pipeline using a Causal Transformer.

Evaluation Highlights

Achieved zero falls during one week of full-day testing in outdoor environments including plazas, sidewalks, and grass fields.
Successfully traversed real-world slopes of up to 8.7% grade and maintained stability under external disturbances like pushes and yoga ball throws.
Demonstrated emergent behavioral adaptation, such as altering gait for slopes and recovering from foot-trapping events, which were not explicitly programmed.

Breakthrough Assessment

9/10

Demonstrates highly robust, zero-shot sim-to-real transfer for a full-sized humanoid on difficult terrains using a pure learning-based approach, outperforming commercial model-based controllers in stability.

⚙️ Technical Details

Problem Definition

Setting: Blind locomotion control on a floating-base humanoid robot

Inputs: History of proprioceptive observations (joint positions, velocities) and past actions

Outputs: Next action (target joint positions/velocities)

Pipeline Flow

Proprioceptive Sensors & Action History
Causal Transformer (Policy)
Action Output (Joint Targets)

System Modules

Causal Transformer

Process history of observations/actions to predict next action

Model or implementation: Transformer (192-dimensional hidden state)

Novel Architectural Elements

Use of a generic Causal Transformer architecture typically used for language modeling to process long-context robot sensor history for implicit environment estimation

Modeling

Base Model: Causal Transformer

Training Method: Large-scale model-free reinforcement learning

Objective Functions:

Purpose: Minimize energy consumption.

Formally: Energy minimization terms in reward function
Purpose: Track velocity commands.

Formally: Velocity tracking terms in reward function

Training Data:

Ensemble of randomized environments in simulation (thousands of environments)
Slopes up to 10% grade in simulation

Key Hyperparameters:

hidden_state_dimension: 192

Comparison to Prior Work

vs. Agility Controller: Ours is learning-based vs model-based; ours recovers from foot-trapping where Agility controller shuts down
vs. LSTM/TCN: Ours uses a Causal Transformer with longer context history for better in-context adaptation [implied by ablation analysis]

Limitations

Robot is blind (no exteroceptive sensors like cameras), leading to inevitable collisions with obstacles like steps.
Does not explicitly handle safety constraints, though none were violated during testing.
Heavy reliance on simulation fidelity and domain randomization.

Reproducibility

No code or model weights provided. The paper relies on the Agility Robotics Digit platform and proprietary high-fidelity simulators.

📊 Experiments & Results

Evaluation Setup

Real-world outdoor deployment and controlled indoor/simulated stress tests

Benchmarks:

Outdoor Terrain Traversal (Locomotion reliability) [New]
Agility Simulator Benchmarks (Comparative stability analysis (Slopes, Steps, Unstable Ground))

Metrics:

Success rate (crossing terrain without falling)
Velocity tracking error
Statistical methodology: 95% Confidence Intervals reported for simulation success rates

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Outdoor walking (1 week)	Falls	Not reported in the paper	0	Not reported in the paper
Slope traversal	Max Grade (%)	Not reported in the paper	8.7	Not reported in the paper
High-speed walking	Velocity (m/s)	0	1.0	+1.0

Experiment Figures

Neural activity analysis during terrain transitions (flat -> slope -> flat).

Neural activity during a foot-trapping event and recovery.

Main Takeaways

The Causal Transformer controller significantly outperforms the state-of-the-art model-based controller on irregular terrains (steps, unstable planks) in simulation.
Emergent behaviors such as arm swinging and gait adjustment (short steps on slopes) appear without explicit reward engineering.
In-context learning is verified via neural activity analysis: neuron firing patterns clearly cluster by terrain type (flat vs slope) and change distinctly during foot-trapping events.
The system demonstrates robustness to significant external disturbances (yoga ball throws, pushing/pulling) and carries diverse payloads (trash bags, backpacks) without retraining.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Transformer Architectures
Robotics Control Theory

Key Terms

proprioceptive: Sensing related to the robot's own internal state, such as joint angles and motor velocities, without external sensors like cameras.

zero-shot: Deploying a model in a new environment (real world) that it was not explicitly trained on, relying entirely on generalization from training (simulation).

causal transformer: A type of Transformer model that only attends to past information (not future) to predict the next token or action, preserving temporal order.

sim-to-real: The process of transferring a policy trained in a physics simulator to a physical robot.

domain randomization: A technique where simulation parameters (friction, mass, etc.) are randomly varied during training to make the policy robust to real-world variations.

in-context learning: The ability of a model to adapt its behavior based on the sequence of inputs it receives at test time, without changing its internal weights.

floating-base: A robot model where the base (torso) is not fixed to the world frame and can move freely in space (6 degrees of freedom).

PD controller: Proportional-Derivative controller—a common feedback control loop used to drive robot joints to desired setpoints.

PCA: Principal Component Analysis—a dimensionality reduction technique.

t-SNE: t-Distributed Stochastic Neighbor Embedding—a visualization technique for high-dimensional data.