Learning Generalizable Tool Use with Non-rigid Grasp-pose Registration

📝 Paper Summary

Robotic Manipulation Imitation Learning from Single Demonstration Tool Use Grasp Generalization

This method enables robots to learn interactive tool-use policies from a single human demonstration by warping the demonstrated grasp to new tool shapes via non-rigid registration and using it to guide Reinforcement Learning.

Core Problem

Robotic tool use with multi-fingered hands has a high-dimensional action space and requires adapting grasps to diverse tool shapes, which typically demands prohibitive amounts of demonstration data.

Why it matters:

Standard Imitation Learning requires vast datasets covering every object variation, which is impractical for real-world deployment
Reinforcement Learning (RL) alone fails in high-dimensional dexterous manipulation tasks due to sample inefficiency and exploration difficulties
Rigidly transferring grasps (e.g., just copying wrist pose) often fails because new object shapes require different finger configurations to maintain functional contact

Concrete Example: When transferring a grasp from a standard hammer to a mallet with a thicker handle, simply copying the joint angles or wrist position causes the fingers to clip through the object or fail to make contact. This method warps the hand configuration so the fingers wrap correctly around the thicker handle.

Key Novelty

Latent Space Non-Rigid Registration for Grasp Transfer

Morphs a 'canonical' tool (with a known good grasp) to match the shape of a new, unseen tool using a deformation field learned from category-level variations
Transfers the grasping contact points (keypoints) along this deformation field, then solves an inverse kinematics problem to find a feasible hand pose that matches these new contact points
Uses this generalized grasp to initialize RL episodes (pre-grasp) and shape rewards, guiding the policy without forcing it to strictly mimic the demonstration

Architecture

Conceptual overview of the system: Demonstration -> Generalization -> RL Policy.

Evaluation Highlights

Achieved 96-97% success rate on 'Place mug' and 'Position drill' tasks using only one canonical demonstration per class
Zero-shot generalization to unseen tools (not in training set) achieved 67% success on 'Place mug' using partial point-cloud inputs
Proposed method reduces mean task-space error to ~0.7cm, significantly outperforming wrist-pose transfer (~2.5cm) and canonical grasp retention (~2.9cm)

Breakthrough Assessment

7/10

Strong contribution in combining geometric registration with RL for sample-efficient tool use. Successfully bridges the gap between rigid imitation and adaptive RL, though evaluation is limited to simulation.

⚙️ Technical Details

Problem Definition

Setting: Learning a continuous control policy π(s) to operate various instances of a tool category (Hammer, Drill, Mug) starting from a single demonstration on a canonical instance.

Inputs: Proprioceptive state (wrist pose, hand keypoints), low-dimensional tool observation (generalized demonstration pose, latent shape parameters), and task-specific goals.

Outputs: Action a_t representing desired changes to end-effector pose and joint positions of the robot hand (Schunk SIH).

Pipeline Flow

Perception (Point Cloud) → Latent Shape Optimization (CPD)
Grasp Transfer (Deformation Field) → Kinematic Regression
RL Policy Initialization (Pre-grasp) → Interactive Control

System Modules

Latent Shape Optimizer (Grasp Generalization)

Fit the canonical object shape to the observed target object point cloud

Model or implementation: Coherent Point Drift (CPD) restricted to a PCA-learned latent subspace

Grasp Transfer & Regression (Grasp Generalization)

Map the demonstration grasp keypoints to the new object and find a feasible hand configuration

Model or implementation: Optimization-based Inverse Kinematics

Interactive Policy

Control the robot hand to perform the tool-use task

Model or implementation: MLP Policy trained via PPO

Novel Architectural Elements

Two-stage grasp generalization: First warping keypoints via latent non-rigid registration, then regressing kinematic configuration to match warped keypoints
Integration of generalized grasps as both episode initialization (pre-grasp) and dense reward shaping signal for RL

Modeling

Base Model: MLP Policy (sizes not specified)

Training Method: Reinforcement Learning with PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Encourage reaching the demonstrated grasp pose.

Formally: r_pose matching ~ exp(-α ||pos_diff|| - β angle_diff)
Purpose: Task completion.

Formally: r_success = indicator(task_complete)
Purpose: Keypoint matching (for hammering).

Formally: r_kp ~ (epsilon + delta_k)^-1

Training Data:

1 canonical instance per category
10 training instances per category
3 test instances per category
Synthetic point clouds for training, depth camera simulation for testing

Key Hyperparameters:

parallel_agents: 16,384
total_steps: 134 million
control_frequency: 30 Hz
+ 1 more
reward_weights: {'pose_matching': '25.0', 'success': '100.0'}

Compute: 3 hours on single NVIDIA A6000 GPU (Isaac Gym simulation)

Comparison to Prior Work

vs. Rigid Registration: Can handle intra-class shape variations (e.g., curved vs straight handles) via non-rigid warping
vs. Pure RL (DAPG/Rajeswaran et al.): Uses a single demonstration + generalization rather than large demonstration datasets
vs. Rodriguez et al. [9]: Optimizes multi-fingered hand configuration in task-space (keypoints) rather than just warping the wrist trajectory/pose
+ 1 more
vs. Neural Descriptor Fields [36] (not cited in paper): Uses explicit geometric registration (CPD) over latent shape space rather than learned neural field descriptors for correspondence

Limitations

Relies on category-level shape prior (PCA basis); may fail for objects significantly outside the training distribution
Requires sequential optimization (Registration -> IK) at inference time (approx 3 seconds), which may be slow for real-time loops
Tested only in simulation; real-world transfer challenges (perception noise, physics gaps) acknowledged but not addressed

Reproducibility

Code: https://maltemosbach.github.io/generalizable_tool_use

Visualizations and videos available at project website. Code repository not linked in paper. Simulation uses NVIDIA Isaac Gym. Specific network architecture dimensions (MLP depth/width) not detailed.

📊 Experiments & Results

Evaluation Setup

Simulated tool use tasks in NVIDIA Isaac Gym

Benchmarks:

Place mug (Pick and place) [New]
Position drill (Dexterous manipulation/reorientation) [New]
Drive nail (Tool use with impact dynamics) [New]

Metrics:

Task-space distance (grasp quality metric)
Success rate (grasping)
Success rate (full task completion)
Statistical methodology: Means and standard deviations reported for distance metrics

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Grasp generation quality assessment comparing the proposed method against baselines in terms of task-space keypoint distance.
Mugs (Task-space distance)	Mean distance (cm)	2.00	0.68	-1.32
Hammers (Task-space distance)	Mean distance (cm)	2.64	0.78	-1.86
RL training performance evaluating success rates of the full policy on training instances.
Position drill	Success rate (full task)	0.66	0.96	+0.30
Drive nail	Success rate (full task)	0.0	0.65	+0.65
Zero-shot transfer results on unseen test objects using partial point clouds.
Place mug (Test set)	Success rate	Not reported in the paper	0.67	Not reported in the paper

Experiment Figures

Visualization of the registration and regression process. Fig 4 shows the canonical mesh warping to match the observation. Fig 5 shows the robot hand optimizing its configuration to match the warped keypoints.

Main Takeaways

Non-rigid registration significantly outperforms rigid baselines (Wrist Pose, Canonical Grasp) in preserving contact relationships across shape variations
Pre-grasp initialization is crucial; without it (w/o PG), complex tasks like nail hammering fail completely (0% success)
The method enables zero-shot transfer to unseen objects by fitting the latent shape model to partial point clouds, though performance drops compared to training instances (e.g., 97% -> 67% on mugs)

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Forward/Inverse Kinematics
Point Set Registration (Coherent Point Drift)
Principal Component Analysis (PCA)

Key Terms

CPD: Coherent Point Drift—an algorithm that finds a non-rigid transformation to align one point cloud with another by treating one as a Gaussian Mixture Model centroid

non-rigid registration: Aligning two 3D shapes by allowing one to deform/warp (stretch, bend) rather than just rotating and translating

task-space vectors: Vectors representing specific keypoints on the robot hand (fingertips, palm) used to define the contact relationship with the object

pre-grasp pose: A hand configuration positioned near the object but not yet touching it, used as a starting point to simplify the exploration problem for RL

Schunk SIH: A specific type of anthropomorphic (human-like) robotic hand with 5 fingers and 11 degrees of freedom

latent shape parameters: A low-dimensional vector produced by PCA that captures the primary ways an object's shape varies within its category (e.g., handle length, head size)

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used here to train the interactive policy