InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

📝 Paper Summary

Humanoid Robot Control Physics-based Motion Imitation

InterReal enables humanoid robots to learn complex interactive skills like box-picking by combining motion augmentation to handle object disturbances and an automatic meta-learner to dynamically tune reward weights.

Core Problem

Existing humanoid controllers excel at locomotion (walking/dancing) but fail at precise Human-Object Interaction (HOI) because they lack fine-grained contact modeling and robustness to real-world sensor noise.

Why it matters:

Current humanoid robots are limited to non-interactive tasks or rely on teleoperation, restricting their autonomy in industrial applications.
Manually tuning rewards for complex interaction tasks is notoriously difficult, as objectives (balance vs. tracking vs. interaction) conflict and shift across motion phases.
Small disturbances in object position perception in the real world can cause standard motion-imitation policies to collapse or fail to grasp objects.

Concrete Example: When a robot attempts to pick up a box, a small error in the perceived object position (e.g., from sensor noise) causes the hands to miss the grasp, leading to failure. Standard policies trained on perfect data cannot adjust, while InterReal's augmented training handles these offsets.

Key Novelty

InterReal: Physics-based HOI framework with Auto-Reward Learning

Augments training data by artificially perturbing object positions and solving Inverse Kinematics (IK) to generate valid contact-preserving references, forcing the policy to learn robustness to spatial noise.
Uses a bi-level optimization where a 'meta-policy' treats the main RL training as an environment, dynamically adjusting reward weights based on tracking errors (e.g., prioritizing balance early, interaction later).

Architecture

Overview of InterReal framework including motion preprocessing, bi-level training loop, and deployment.

Evaluation Highlights

Achieves highest task success rates on both Box-Picking and Box-Pushing tasks compared to baselines like InterMimic and PHC.
Lowest tracking error for key metrics (DOF angles, object positions) in simulation, demonstrating superior precision.
Successful real-world deployment on the Unitree G1 robot using FoundationPose for object tracking, validating sim-to-real robustness.

Breakthrough Assessment

7/10

Solid advancement in humanoid HOI by addressing two critical bottlenecks: reward engineering and contact robustness. Real-world validation on G1 is a strong plus, though the core algorithms (meta-learning for rewards, IK augmentation) are evolutionary rather than revolutionary.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for physics-based motion tracking

Inputs: State s_t = [proprioception, object features, interaction graph, task phase]

Outputs: Action a_t (target joint positions for PD controllers)

Pipeline Flow

Perception (FoundationPose)
State Estimation (Proprioception + Object)
Policy Inference (Actor Network)
Low-level Control (PD Controller)

System Modules

Object Perception

Estimate 6D object pose from visual input

Model or implementation: FoundationPose

Policy Network (Actor)

Generate target joint positions based on current state

Model or implementation: MLP (Multi-Layer Perceptron)

PD Controller

Convert target joint positions into motor torques

Model or implementation: Standard PD logic

Novel Architectural Elements

Bi-level optimization loop: An outer 'Meta-Policy' (SAC) dynamically outputs reward weights for the inner 'Control Policy' (PPO) training based on tracking error feedback.

Modeling

Base Model: Custom MLP policies for both PPO (inner) and SAC (outer)

Training Method: Nested RL: Inner loop PPO (Control) + Outer loop SAC (Reward Learning)

Objective Functions:

Purpose: Optimize control policy to maximize weighted sum of tracking/interaction rewards.

Formally: PPO objective maximizing E[sum(gamma^t * f_t(Theta))]
Purpose: Optimize meta-policy to find reward weights that minimize tracking error.

Formally: SAC objective maximizing E[G_t + alpha * H], where G_t is negative change in tracking errors

Adaptation: Sim-to-real via Domain Randomization and Asymmetric Actor-Critic

Training Data:

Retargeted SMPL motion data
Augmented via IK solver (Ipopt) with XY-plane object offsets

Key Hyperparameters:

meta_learning_action_entropy_alpha: 0.1
inner_loop_discount_factor_gamma: Not explicitly reported in the paper
outer_loop_algorithm: Soft Actor-Critic (SAC)

Compute: Not reported in the paper

Comparison to Prior Work

vs. InterMimic: InterReal targets real robot hardware (G1) with realistic physics/noise, whereas InterMimic focuses on simplified animation physics.
vs. PHC: InterReal adds explicit object interaction modeling and reward learning, whereas PHC focuses on body motion tracking.
vs. Teleoperation: InterReal is an autonomous policy learned from data, whereas teleoperation relies on human real-time control.

Limitations

Dependency on accurate object pose estimation (FoundationPose) in real world; failure in perception leads to task failure.
Requires high-quality motion capture data for retargeting; cannot learn skills from scratch without reference motions.
Computational cost of bi-level optimization (meta-learning loop) is higher than standard RL, though authors claim overhead is minimal.

Reproducibility

Code availability is not provided. Real-world experiments use Unitree G1 robot. Object tracking uses FoundationPose. Simulation uses IsaacGym. Retargeting uses a modified SMPL method.

📊 Experiments & Results

Evaluation Setup

Physics simulation (IsaacGym/Mujoco) and Real-world Robot (Unitree G1)

Benchmarks:

Box-Picking (Human-Object Interaction) [New]
Box-Pushing (Human-Object Interaction) [New]

Metrics:

Tracking Error (Joints, Object, Links)
Task Success Rate
Grasp Success Rate (implied via object tracking)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
InterReal outperforms baselines on tracking accuracy and task success for interactive tasks.
Box-Picking (Simulation)	Task Success Rate	Not reported in the paper	Highest	Positive
Box-Pushing (Simulation)	Tracking Error	Not reported in the paper	Lowest	Positive

Experiment Figures

Schematic of the bi-level learning process.

Main Takeaways

InterReal achieves superior tracking accuracy and success rates compared to InterMimic and PHC baselines on HOI tasks.
Automatic reward learning (Meta-Policy) is more effective than fixed heuristic rewards, adapting to different task phases (e.g., prioritizing balance vs. interaction).
Motion augmentation significantly improves robustness against object position perturbations, which is critical for the sim-to-real transfer demonstrated on the Unitree G1.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, SAC)
Kinematics / Inverse Kinematics (IK)
Sim-to-Real transfer
Meta-learning concepts

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

HOI: Human-Object Interaction—tasks where a robot or human actively manipulates an object (e.g., carrying a box)

PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm used here for the inner-loop control policy

SAC: Soft Actor-Critic—an entropy-regularized reinforcement learning algorithm used here for the outer-loop meta-policy to learn reward weights

IK: Inverse Kinematics—a mathematical process to calculate the joint angles needed to position a robot's end-effector (hand) at a specific target point

SMPL: Skinned Multi-Person Linear model—a standard parametric model for representing human body shape and pose

IsaacGym: A high-performance physics simulator from NVIDIA used for parallel reinforcement learning training

Unitree G1: A specific commercial humanoid robot hardware platform used for real-world validation

FoundationPose: A computer vision method for estimating the 6D pose (position and orientation) of objects from images/depth data

PD controller: Proportional-Derivative controller—a feedback loop mechanism widely used in control systems to minimize error

Domain Randomization: A technique to improve sim-to-real transfer by varying simulation parameters (friction, mass) during training

Interaction Graph: A feature representation that encodes the distances between key points on the robot and the object to guide contact learning

MDP: Markov Decision Process—a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker