End-to-End Reinforcement Learning for Torque Based Variable Height Hopping

📝 Paper Summary

Legged Locomotion Sim-to-Real Transfer Robot Control

A model-free reinforcement learning torque controller enables a monoped robot to hop at variable heights using only proprioceptive data, eliminating the need for explicit state machines or height estimators.

Core Problem

Classical hopping controllers rely on complex finite state machines, manual tuning, and explicit estimation of jump phases (lift-off, touchdown), which are brittle and difficult to transfer to real hardware.

Why it matters:

Hopping allows robots to clear obstacles that wheeled or walking robots cannot, but the flight phase significantly increases control complexity due to lack of actuation authority.
Existing methods typically require heuristics for contact detection and height estimation, which fail if sensors are noisy or model dynamics are inaccurate.
Using PD controllers in RL (common practice) limits the exploitation of full system dynamics compared to direct torque control.

Concrete Example: A traditional hopping controller must detect 'touchdown' to switch from position control (flight) to force control (stance). If the height estimator drifts or contact sensors are noisy, the controller switches at the wrong time, causing the robot to crash or stumble.

Key Novelty

End-to-End Proprioceptive Torque Control for Hopping

Replaces finite state machines and PD loops with a single neural network policy that maps history of joint states directly to motor torques.
Achieves implicit phase detection (lift-off/touchdown) purely through proprioceptive history, removing the need for external contact sensors or explicit height estimators.
Uses an energy-shaping inspired reward function to encourage periodic hopping behavior without strictly prescribing trajectories.

Architecture

Comparison between the classical control loop and the proposed RL control loop.

Evaluation Highlights

Successfully transferred to real hardware (200Hz control loop) without parameter tuning, achieving stable continuous hopping.
Demonstrated variable height control by tracking desired jump height commands (e.g., oscillating between 0.3m and 0.45m).
Learned to implicitly detect ground contact phases solely from joint positions and velocities, matching the behavior of state-machine baselines without explicit logic.

Breakthrough Assessment

7/10

Significant for demonstrating truly end-to-end torque control for a highly dynamic task (hopping) without state machines. The sim-to-real methodology is solid, though the task is limited to a 1D-constrained monoped.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for continuous control of a monoped

Inputs: Proprioceptive state (joint positions/velocities) and desired height history over last 3 steps

Outputs: Normalized motor torques for 2 active joints

Pipeline Flow

Observation (Joint History)
Policy Network (MLP)
Action (Torque Command)

System Modules

Observation Buffer

Captures temporal context to allow implicit velocity/contact estimation

Model or implementation: FIFO Buffer

Policy Network

Maps states to torque actions

Model or implementation: MLP (256, 256, 128, 128) with ReLU

Novel Architectural Elements

Direct mapping from proprioception to torque for hopping, bypassing the standard 'PD-action' layer used in most legged RL

Modeling

Base Model: Feedforward MLP Policy (SAC)

Training Method: Soft Actor-Critic (SAC) with Generalized State Dependent Exploration (gSDE)

Objective Functions:

Purpose: Maximize hopping height via energy.

Formally: r_energy = w1 * (E_kin + E_pot)
Purpose: Track desired height.

Formally: r_track = w2 * exp(-|x - x_d|)
Purpose: Prevent shakiness.

Formally: r_smooth = -w3 * ||a_t - a_{t-1}||^2
Purpose: Safety limits.

Formally: r_limit (penalties for joint limits and velocity saturation)

Training Data:

Trained in MuJoCo simulation environment
Simulation parameters optimized to match real hardware data via CMA-ES

Key Hyperparameters:

network_architecture: [256, 256, 128, 128]
activation: ReLU
control_frequency: 200 Hz
+ 2 more
observation_history: 3 steps
gSDE_noise: Yes

Compute: Not reported in the paper

Comparison to Prior Work

vs. Energy Shaping: RL controller is monolithic (no state machine) and uses direct torque, whereas ES uses impedance control + explicit phase switching.
vs. Rudin et al. (2022) [Hop-ping quadruped]: Rudin uses PD control and height estimation; this work uses direct torque and no explicit height estimation.

Limitations

Tested on a monoped constrained to a vertical rail (planar/1D hopping), not fully unconstrained 3D hopping.
Requires accurate system identification (CMA-ES optimization) for the simulation to match reality sufficiently for transfer.
No external perception (cameras), so cannot handle uneven terrain or obstacles (though not the goal of this paper).

Reproducibility

Code: https://github.com/dfki-ric-underactuated-lab/hopping_leg

Code is publicly available at https://github.com/dfki-ric-underactuated-lab/hopping_leg. Simulation environment (MuJoCo) and robot hardware details (mjbots actuators, custom leg) are specified.

📊 Experiments & Results

Evaluation Setup

Sim-to-Real transfer on a custom 3-DOF hopping leg (2 active, 1 passive rail DOF).

Benchmarks:

Custom Hopping Leg Hardware (Variable height hopping) [New]

Metrics:

Tracking error (base height vs desired)
Stability (continuous hopping duration)
Sim-to-Real correlation
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Hardware experiments confirm the controller can hop and track height commands.
Hardware Hopping	Phase Detection	Explicit Logic	Implicit	Not applicable
Hardware Hopping	Height Tracking	0.3m	0.3m - 0.45m	Variable

Experiment Figures

The two-stage system identification optimization process using CMA-ES.

Plots of base height, velocity, and motor torques during real-world hopping.

Main Takeaways

Proprioceptive history (3 steps) is sufficient for a neural network to infer contact states and hopping phases.
Direct torque control via RL is feasible for highly dynamic hopping without PD stabilizers, provided simulation dynamics are tuned.
Optimizing simulation parameters (friction, damping, armature) via CMA-ES was critical for successful zero-shot transfer.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (SAC algorithm)
Robotic Dynamics (Torque control, Jacobians)
Sim-to-Real Transfer (System Identification)

Key Terms

Monoped: A one-legged robot used as a canonical system to study hopping and dynamic locomotion

Proprioceptive: Sensing internal state (joint angles, velocities) rather than external state (cameras, lidar, contact sensors)

PD Controller: Proportional-Derivative controller—a feedback loop that drives error to zero; often used in RL as an intermediate layer, but avoided here in favor of direct torque

SAC: Soft Actor-Critic—an off-policy RL algorithm that maximizes expected reward and policy entropy for robust exploration

CMA-ES: Covariance Matrix Adaptation Evolution Strategy—a derivative-free optimization algorithm used here to tune simulation parameters to match reality

Energy Shaping: A control strategy that regulates the total energy (kinetic + potential) of a system to achieve a desired behavior (like hopping height)

gSDE: Generalized State Dependent Exploration—an exploration noise strategy where noise is a function of state, leading to smoother actions than independent step noise