Evaluation Setup
6 custom tasks requiring creative tool use, evaluated in both Simulation and Real World.
Benchmarks:
- Creative Tool Use Benchmark (Robotic manipulation and locomotion) [New]
Metrics:
- Success Rate
- Tool Use Error
- Logical Error
- Numerical Error
- Statistical methodology: Averaged across 10 runs per task in simulation.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Simulation success rates show RoboTool consistently solving creative tasks where baselines fail. |
| Sofa-Traversing |
Success Rate |
0.1 |
1.0 |
+0.9
|
| Milk-Reaching |
Success Rate |
0.0 |
0.9 |
+0.9
|
| Can-Grasping |
Success Rate |
0.1 |
0.7 |
+0.6
|
| Average (All 6 Tasks) |
Success Rate |
0.20 |
0.87 |
+0.67
|
| Real-world evaluation validates sim-to-real transferability. |
| Average (All 6 Tasks) |
Success Rate |
0.87 |
0.77 |
-0.10
|
Main Takeaways
- RoboTool successfully exhibits three types of creative tool use: selection (choosing proper tools), sequential use (multi-step plans), and manufacturing (creating levers/hooks).
- Ablation studies confirm 'Analyzer' is critical for identifying correct tools (reducing tool-use error) and 'Calculator' is critical for reducing numerical errors in physical interactions.
- The 'Analyzer' module enables discriminative behavior: RoboTool avoids using tools when tasks are feasible without them (e.g., small gaps), whereas baselines over-use tools based on priors.