Creative Robot Tool Use with Large Language Models

📝 Paper Summary

Multi-call tool use with flexible plan Multi-task planning Embodied AI / Robotics

RoboTool is a modular LLM-based system that enables robots to solve long-horizon tasks requiring creative tool use by identifying implicit physical constraints and generating executable Python code.

Core Problem

Robots struggle with tasks involving implicit physical constraints (e.g., reaching objects out of workspace, crossing wide gaps) that require creative tool use—improvising with available objects beyond their standard affordances.

Why it matters:

Traditional Task and Motion Planning (TAMP) relies on explicit optimization, which is computationally expensive and difficult to scale for complex, long-horizon tasks.
Existing LLM robotics methods often assume standard tool usage or static environments, failing when tasks require reasoning about physical properties like material, shape, or gap width to improvise solutions.
Creative tool use (using a surfboard as a bridge, or a hammer as a hook) is a hallmark of advanced intelligence lacking in standard robotic control systems.

Concrete Example: A quadrupedal robot needs to 'walk to the other sofa,' but a 0.4m gap exists between sofas, exceeding its 0.1m step limit. A standard planner fails because the gap constraint is implicit. RoboTool analyzes the scene, calculates the gap width, and decides to push a surfboard to bridge the gap.

Key Novelty

Modular LLM-based Creative Tool User (RoboTool)

Decomposes the planning process into four specialized LLM agents: Analyzer (identifies constraints), Planner (strategies), Calculator (parameters), and Coder (executable code).
Explicitly prompts an LLM to function as a 'Calculator' to derive numerical parameters (e.g., target coordinates for a push) based on object affordances, bridging high-level reasoning with low-level control.
Enables three distinct types of creativity: Tool Selection (choosing correct tools), Sequential Tool Use (multi-step plans), and Tool Manufacturing (assembling/modifying objects).

Architecture

The hierarchical architecture of RoboTool with its four key components.

Evaluation Highlights

Achieves 100% success rate on 'Sofa-Traversing' and 'Sofa-Climbing' tasks in simulation, compared to 0-10% for the 'Planner-Coder' baseline.
Outperforms the 'Coder' (Code-as-Policies style) baseline by a large margin across all 6 creative tasks, which achieved near 0% success on most tasks.
Maintains high performance (0.7-0.9 success rates) in real-world experiments with a quadrupedal robot and robotic arm, despite perception noise.

Breakthrough Assessment

8/10

Significantly advances robotic reasoning by demonstrating zero-shot 'creative' behaviors (tool manufacturing/improvisation) using standard LLMs, solving problems traditional TAMP and direct coding methods fail at.

⚙️ Technical Details

Problem Definition

Setting: Hybrid discrete-continuous planning problem with environment and embodiment constraints

Inputs: Natural language description L = {Task LT, Scene Description LQ, Constraints LC}

Outputs: Executable Python code τ((H, X), Π, L) invoking parameterized skills Π with parameters X

Pipeline Flow

Analyzer: Processes language input → Identifies key constraints/concepts
Planner: Input + Concepts → High-level plan skeleton
Calculator: Plan skeleton + Numerical data → Parameterized plan
Coder: Parameterized plan → Executable Python code

System Modules

Analyzer

Interpret natural language to discern key task-related concepts and physical constraints (e.g., calculating gap width)

Model or implementation: GPT-4

Planner

Generate comprehensive strategies based on language input and key concepts

Model or implementation: GPT-4

Calculator

Compute parameters (X) for each skill in the plan skeleton

Model or implementation: GPT-4

Coder

Translate the parameterized plan into executable Python code

Model or implementation: GPT-4

Novel Architectural Elements

Separation of 'Analyzer' module specifically to identify implicit physical constraints before planning
Dedication of a 'Calculator' module solely for numerical reasoning/parameter estimation within the planning loop
Four-stage hierarchical pipeline (Analyze -> Plan -> Calculate -> Code) designed to bridge high-level semantic reasoning with low-level metric constraints

Modeling

Base Model: GPT-4

Compute: Not reported in the paper (Inference-only approach using API calls)

Comparison to Prior Work

vs. Code-as-Policies: RoboTool uses a modular pipeline (Analyzer/Calculator) to handle implicit constraints and numerical parameters, whereas CaP attempts direct generation which fails on creative tasks.
vs. Logic-Geometric Programming (TAMP): RoboTool does not require explicit optimization or formal logic definitions, offering more flexibility.
vs. VoxPoser: RoboTool relies on textual/numerical reasoning via LLM rather than learning visual value maps [not cited in paper as direct baseline, but mentioned in related work].

Limitations

Relies on existing perception APIs (e.g., OWL-ViT, AprilTags) to parse visual scenes into text descriptions; errors in perception propagate to the planner.
Open-loop code generation; while code can include checks, the planner itself doesn't actively replan based on real-time feedback during execution.
Performance depends heavily on the reasoning capabilities of the underlying LLM (GPT-4); weaker models may fail at the 'Analyzer' or 'Calculator' stages.

Reproducibility

Code: https://creative-robotool.github.io/

Project page provided (https://creative-robotool.github.io/). Prompts and prompt engineering details are described in Appendix D. Paper relies on GPT-4 API (closed source). Simulators used: Robosuite (robotic arm) and Unitree Go1 simulation.

📊 Experiments & Results

Evaluation Setup

6 custom tasks requiring creative tool use, evaluated in both Simulation and Real World.

Benchmarks:

Creative Tool Use Benchmark (Robotic manipulation and locomotion) [New]

Metrics:

Success Rate
Tool Use Error
Logical Error
Numerical Error
Statistical methodology: Averaged across 10 runs per task in simulation.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation success rates show RoboTool consistently solving creative tasks where baselines fail.
Sofa-Traversing	Success Rate	0.1	1.0	+0.9
Milk-Reaching	Success Rate	0.0	0.9	+0.9
Can-Grasping	Success Rate	0.1	0.7	+0.6
Average (All 6 Tasks)	Success Rate	0.20	0.87	+0.67
Real-world evaluation validates sim-to-real transferability.
Average (All 6 Tasks)	Success Rate	0.87	0.77	-0.10

Experiment Figures

Visualization of the 6 benchmark tasks demonstrating different types of creative tool use.

Error breakdown and discriminative tool-use analysis.

Main Takeaways

RoboTool successfully exhibits three types of creative tool use: selection (choosing proper tools), sequential use (multi-step plans), and manufacturing (creating levers/hooks).
Ablation studies confirm 'Analyzer' is critical for identifying correct tools (reducing tool-use error) and 'Calculator' is critical for reducing numerical errors in physical interactions.
The 'Analyzer' module enables discriminative behavior: RoboTool avoids using tools when tasks are feasible without them (e.g., small gaps), whereas baselines over-use tools based on priors.

📚 Prerequisite Knowledge

Prerequisites

Task and Motion Planning (TAMP)
Large Language Models (LLMs)
Robotic Control Primitives (skills)
Affordance learning

Key Terms

TAMP: Task and Motion Planning—combining high-level discrete logic (what to do) with low-level continuous motion control (how to move)

Affordance: The qualities or properties of an object that define its possible uses or how it can be interacted with (e.g., a handle affords grasping)

Parameterized Skills: Pre-defined robotic functions that take arguments, such as 'move_to(position)' or 'close_gripper()'

Zero-shot: The ability of a model to perform a task without having seen explicit training examples for that specific task

Creative Tool Use: Using objects in unconventional ways (e.g., using a rock as a hammer) or modifying the environment to solve a problem

Code-as-Policies: A framework where LLMs generate executable code to control robots directly from natural language

Workspace: The physical 3D volume within which a robot arm can reach and manipulate objects