Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning

📝 Paper Summary

Humanoid Robotics Loco-manipulation Sim-to-Real Transfer Whole-body Control

This survey synthesizes 30 years of humanoid research, highlighting the convergence of model-based control and learning-based methods to achieve unified whole-body loco-manipulation rather than separate locomotion and manipulation.

Core Problem

Humanoid robots have historically treated locomotion (balancing/walking) and manipulation (hand tasks) as separate problems, failing to leverage the whole body for complex, contact-rich real-world tasks.

Why it matters:

Separate control schemes limit robots to static bases or simple walking, preventing human-level tasks like carrying heavy loads up stairs or pushing large objects
Current robots lack the efficiency of humans (Cost of Transport > 0.7 vs. 0.2), partly due to rigid control rather than compliant whole-body dynamics
Real-world deployment requires handling unstructured environments and safe physical collaboration with humans, which pure trajectory tracking cannot accommodate

Concrete Example: Humans use elbows or hips to hold a door open while carrying a box (whole-body manipulation). In contrast, most traditional humanoid controllers strictly limit interaction to feet and fingertips, treating any other body contact as a disturbance to be rejected rather than a useful leverage point.

Key Novelty

Comprehensive Survey of Humanoid Loco-Manipulation (HLM)

Reviews the evolution from distinct Model-Based methods (MPC, WBC) to integrated Learning-Based approaches (Sim-to-Real RL, Imitation Learning)
Identifies 'Loco-Manipulation' as the critical capability gap, defined as the simultaneous coordination of locomotion and manipulation using the whole body
Proposes that Foundation Models (FMs) and whole-body tactile sensing are the emerging pillars needed to solve high-level planning and contact-rich interaction

Architecture

A roadmap of the humanoid robotics field, organizing sub-disciplines into a hierarchy from Hardware/Design up to Planning/Decision Making.

Breakthrough Assessment

8/10

An extensive, timely survey that bridges the gap between classical control theory and modern AI in robotics. Essential reading for understanding the current 'humanoid race' involving companies like Tesla and Figure.

⚙️ Technical Details

Problem Definition

Setting: Whole-body Loco-manipulation (HLM)

Inputs: High-level task commands, proprioceptive state, and exteroceptive sensing (vision/tactile)

Outputs: Joint torques or position commands for all degrees of freedom (arms + legs + torso) simultaneously

Pipeline Flow

Task Planner (Foundation Model / Graph Search)
Motion Planner / Trajectory Optimizer
Whole-Body Controller (WBC) / Low-level Policy

System Modules

Task Planner

Decomposes abstract goals into executable sub-tasks

Model or implementation: Foundation Models (e.g., GPT-4, PaLM) or Symbolic Planners

Predictive Controller

Generates dynamic trajectories satisfying stability constraints

Model or implementation: Model Predictive Control (MPC) solving Optimal Control Problems

Reactive Controller

Computes instantaneous joint torques to track planned trajectories

Model or implementation: Whole-Body Control (WBC) via Quadratic Programming (QP)

Novel Architectural Elements

Converged Predictive-Reactive Hierarchy: The standardization of MPC (predictive) + WBC (reactive) as the backbone of model-based humanoid control
Integration of Foundation Models as the 'Task Planner' layer within the classical hierarchy to enable semantic reasoning

Comparison to Prior Work

vs. ZMP-based: Modern HLM (Humanoid Loco-Manipulation) methods handle multi-contact scenarios and external forces, whereas ZMP assumes a flat floor and no hand contact
vs. Separate Control: HLM unifies optimization for arms and legs, allowing the robot to use its weight/momentum to assist manipulation (e.g., throwing)
vs. Pure RL: Model-based methods offer safety guarantees and precise force control, while RL offers robustness to unseen terrain; the survey advocates for hybrid approaches

Limitations

Computational efficiency remains a bottleneck for real-time nonlinear MPC on high-DoF humanoids
Sim-to-Real gap persists due to imperfect physics modeling, particularly for soft contacts and deformable objects
Lack of whole-body tactile sensing hardware restricts the ability to perform sensitive, contact-rich manipulation tasks
Current energy efficiency (COT > 0.7) is far worse than human capability (COT ≈ 0.2), limiting operation time

Reproducibility

Survey paper; cites numerous open-source projects but does not provide a single codebase. Mentions commercial platforms like Boston Dynamics Atlas, Tesla Optimus, and Figure 01.

📊 Experiments & Results

Evaluation Setup

Review of literature spanning 30 years, categorizing systems by capability (locomotion, manipulation, loco-manipulation) and method (model-based, learning-based).

Benchmarks:

DARPA Robotics Challenge (Disaster response (valves, doors, debris))
Cost of Transport (COT) (Energy efficiency metric)

Metrics:

Success rate in unstructured environments
Robustness to external pushes
Inference frequency (Hz)
Cost of Transport (COT)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Convergence of Paradigms: The debate between model-based and learning-based control is dissolving; Sim-to-Real RL relying on accurate dynamics is effectively 'model-based', and hybrid systems (RL for planning, WBC for execution) are becoming standard.
Efficiency Gap: Modern humanoids have a Cost of Transport (COT) > 0.7, significantly less efficient than humans (COT ≈ 0.2), driving the need for better passive-compliant hardware and natural dynamics exploitation.
The 'Loco-Manipulation' Shift: The field is moving from static-base manipulation to whole-body involvement, where legs and torso actively contribute to manipulation forces (e.g., opening heavy doors).
Role of Foundation Models: FMs are identified as the solution to the 'intention' problem—allowing robots to understand *why* a human acts, not just copy the motion, enabling generalist task planning.
Tactile Sensing Void: Whole-body tactile sensing is highlighted as a critical missing modality for safe physical human-robot interaction and handling contact-rich tasks.

📚 Prerequisite Knowledge

Prerequisites

Rigid body dynamics
Optimal Control (MPC, WBC)
Reinforcement Learning (RL) basics
Robotics kinematics

Key Terms

Loco-manipulation: Simultaneous execution of locomotion (moving the base) and manipulation (interacting with objects), requiring unified whole-body coordination

MPC: Model Predictive Control—an optimal control method that plans a trajectory by minimizing a cost function over a future time horizon, re-planning at every step

WBC: Whole-Body Control—a control framework that calculates joint commands to execute task-space objectives (e.g., hand position) while respecting constraints (balance, friction cones)

Sim-to-Real: A learning paradigm where a policy is trained in a physics simulator (often using RL) and then transferred to a physical robot, often requiring domain randomization to handle model mismatches

COT: Cost of Transport—a measure of energy efficiency defined as energy expenditure per unit distance normalized by weight (human COT ≈ 0.2, current humanoids > 0.7)

Centroidal Dynamics: A simplified model representing the robot as a single mass at its center of mass with associated linear and angular momentum, used to reduce computational cost in MPC

ZMP: Zero Moment Point—a point on the ground where the total tipping moment is zero; keeping the ZMP within the support polygon ensures dynamic stability

Foundation Models: Large-scale pre-trained models (like LLMs or VLMs) used in robotics for high-level semantic reasoning, task planning, and understanding human intentions

Behavior Cloning: A form of imitation learning where the robot learns a policy by directly mimicking expert demonstrations (state-action pairs)