Evaluation Setup
End-to-end task execution using a library of 14 tools across perception, operation, logic, and creativity.
Benchmarks:
- GTA (Multimodal Tool Use) [New]
Metrics:
- Task Completion Rate (Success Rate)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Overall performance on the GTA benchmark highlights a significant gap between current SOTA models and the requirements of real-world tool agents. |
| GTA |
Task Completion Rate |
100.0 |
50.0 |
-50.0
|
| GTA |
Task Completion Rate |
100.0 |
25.0 |
-75.0
|
Main Takeaways
- Real-world queries with implicit steps are significantly harder than AI-generated explicit queries found in previous benchmarks.
- Multimodal context is a bottleneck; models struggle to integrate visual information into tool planning.
- There is a massive performance gap, with even the best model (GPT-4) failing more than half the time, suggesting current agents are not yet 'general' tool users.