Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer

📝 Paper Summary

Humanoid Locomotion Sim-to-Real Transfer Reinforcement Learning for Robotics

Humanoid-Gym is an open-source reinforcement learning framework that enables humanoid robots to learn locomotion skills in simulation and transfer them to the real world zero-shot using specialized rewards and a sim-to-sim validation tool.

Core Problem

The complex structure of humanoid robots creates a larger sim-to-real gap compared to quadrupeds, making it difficult to transfer locomotion policies trained in simulation directly to physical hardware.

Why it matters:

Humanoid robots are uniquely suited for human-centric environments but are harder to control than other robot types due to stability and complexity issues.
Existing open-source resources for humanoid locomotion are lacking compared to quadrupeds, hindering research progress in this area.
Testing policies directly on expensive humanoid hardware is risky; robust simulation verification is needed before real-world deployment.

Concrete Example: A policy trained in a standard simulator (Isaac Gym) might exploit physics inaccuracies, causing a real humanoid robot to fall immediately upon deployment. Humanoid-Gym mitigates this by validating the policy in a second, higher-fidelity simulator (MuJoCo) before real-world attempts.

Key Novelty

Sim-to-Sim-to-Real Verification Pipeline

Introduces a rigorous validation step where policies trained in high-speed Isaac Gym are tested in high-fidelity MuJoCo simulations before real-world deployment.
Utilizes a specialized reward function designed for humanoids, focusing on velocity tracking, gait stability, and smooth foot contact patterns.
employs meticulous domain randomization to robustify the policy against physical uncertainties.

Architecture

The Humanoid-Gym workflow, illustrating the training process in Isaac Gym, validation in MuJoCo, and deployment to the real robot.

Evaluation Highlights

Achieved successful zero-shot transfer to two real-world humanoid robots: RobotEra’s XBot-S (1.2m) and XBot-L (1.65m).
Demonstrated robust locomotion on both flat and uneven terrains in the real world using the same trained policy.
Calibrated MuJoCo simulation showed nearly identical joint trajectories to real-world data, validating the sim-to-sim framework's effectiveness.

Breakthrough Assessment

7/10

Significant contribution as a comprehensive open-source framework for humanoid RL, addressing the scarcity of such tools. The dual-simulation validation approach is practical and effective for bridging the sim-to-real gap.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) for locomotion control

Inputs: Proprioceptive sensor data, periodic clock signal, velocity commands, and privileged observations (training only)

Outputs: Target joint positions for the PD controller

Pipeline Flow

State Observation (Proprioception + Commands)
Policy Network (Actor)
PD Controller
Motor Actuation

System Modules

Policy Network

Maps observations to target joint positions

Model or implementation: Not explicitly detailed (likely MLP standard for PPO)

PD Controller

Converts target positions to motor torques

Model or implementation: Standard PID

Novel Architectural Elements

Integration of a Sim-to-Sim validation loop (Isaac Gym -> MuJoCo) within the training pipeline to verify physics robustness prior to real-world deployment

Modeling

Base Model: PPO-based RL Agent

Training Method: Proximal Policy Optimization (PPO) with Asymmetric Actor Critic

Objective Functions:

Purpose: Maximize expected return while maintaining policy stability.

Formally: PPO clipped surrogate objective.
Purpose: Estimate value function for advantage calculation.

Formally: MSE loss on value prediction.

Training Data:

Massively parallel simulation in Nvidia Isaac Gym

Key Hyperparameters:

control_frequency: 100 Hz
pd_frequency: 1000 Hz
discount_factor: Not explicitly reported in the paper
+ 1 more
clip_param: Not explicitly reported in the paper

Compute: Training on GPU (Isaac Gym), Validation on CPU (MuJoCo)

Comparison to Prior Work

vs. Legged Gym: Specifically tailored for humanoid kinematics and stability challenges, incorporating sim-to-sim validation.
vs. Transformer-based approaches: Uses standard MLP policies with specialized domain randomization and reward shaping for efficiency and zero-shot transfer [not cited in paper as direct comparison, but implied context].

Limitations

Sim-to-real gap remains a challenge despite improvements.
Performance depends heavily on accurate URDF modeling of the specific robot.
Requires manual tuning of reward weights for specific gaits.

Reproducibility

Code: https://sites.google.com/view/humanoid-gym

Code is publicly available at sites.google.com/view/humanoid-gym. The framework relies on Isaac Gym and MuJoCo. Specific robot configurations (XBot-S, XBot-L) are used for verification.

📊 Experiments & Results

Evaluation Setup

Sim-to-sim validation in MuJoCo and real-world deployment on XBot-S and XBot-L humanoids.

Benchmarks:

Sim-to-Sim Consistency (Physics Verification) [New]
Real-world Locomotion (Zero-shot Transfer) [New]

Metrics:

Joint position tracking error
Phase portrait similarity
Success of locomotion on flat and uneven terrain
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Sim-to-Real Comparison	Phase Portrait Similarity	Qualitatively distinct from Real World	Qualitatively identical to Real World	Improved visual alignment
XBot-S & XBot-L	Terrain Traversal	Not reported in the paper	Successful traversal	Not applicable

Experiment Figures

Comparison of leg swing sine waves between MuJoCo simulation and real-world robot execution.

Phase portraits of left knee and ankle pitch joints for Isaac Gym, MuJoCo, and Real World.

Main Takeaways

The proposed sim-to-sim framework (Isaac Gym -> MuJoCo) effectively filters out policies that exploit simulator inaccuracies.
Specialized reward functions for humanoids (velocity tracking + gait priors) are sufficient for stable zero-shot transfer.
MuJoCo calibration allows it to serve as a high-fidelity proxy for the real world, enabling safe policy validation without hardware risks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Robotics Kinematics and Dynamics
Sim-to-Real Transfer techniques

Key Terms

PPO: Proximal Policy Optimization—a policy gradient method for reinforcement learning that alternates between sampling data through interaction with the environment and optimizing a 'surrogate' objective function

Sim-to-Real: The process of transferring policies trained in a simulated environment to a physical robot

Sim-to-Sim: Validating a policy trained in one simulator (e.g., Isaac Gym) by testing it in another simulator with different physics engines (e.g., MuJoCo) to ensure robustness

PD Controller: Proportional-Derivative controller—a control loop mechanism employing feedback that is widely used in industrial control systems

Domain Randomization: A technique where simulation parameters (friction, mass, etc.) are randomized during training to make the policy robust to variations in the real world

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making in situations where the system state is only partially known