GTA: A Benchmark for General Tool Agents

📝 Paper Summary

Agentic AI Tool Learning Evaluation Benchmarks

GTA is a benchmark evaluating tool agents on human-written, multimodal tasks where tool-use steps are implicit, revealing that current LLMs struggle significantly with real-world planning.

Core Problem

Existing tool-use benchmarks rely on AI-generated queries with explicit step-by-step instructions, dummy tools, and text-only contexts, which fail to test an agent's ability to reason and plan in complex real-world scenarios.

Why it matters:

AI-generated queries often explicitly hint at tool usage (e.g., 'Use the search tool to find...'), bypassing the critical reasoning phase required in reality.
Real-world user interaction is multimodal (images, screenshots, spatial scenes), but most benchmarks are text-only.
Simulated tools in prior benchmarks only evaluate isolated steps, failing to test end-to-end execution reliability.

Concrete Example: A typical benchmark might ask 'Use Google Search to find the 2024 QS ranking of Tsinghua,' making the plan obvious. GTA asks 'What is the 2024 QS ranking of Tsinghua?' (implicit tool need) accompanied by a relevant screenshot, requiring the agent to deduce the need for search and visual interpretation.

Key Novelty

Realistic Implicit Tool-Use Evaluation (GTA)

Constructs queries using human design rather than AI generation to ensure goals are clear but execution steps are implicit, forcing agents to plan rather than just follow instructions.
Integrates 14 real deployed tools across perception, operation, logic, and creativity categories, executing actual code rather than simulating outputs.
Incorporates authentic multimodal inputs (images, code snippets, tables) as essential context for the queries.

Architecture

The dataset construction pipeline for GTA.

Evaluation Highlights

GPT-4 achieves a success rate of less than 50% on GTA tasks, highlighting the difficulty of real-world implicit planning.
Most mainstream LLMs achieve a success rate below 25%, indicating a significant gap between current capabilities and general agent requirements.

Breakthrough Assessment

8/10

Significantly raises the bar for agent evaluation by moving away from 'toy' synthetic tasks to implicit, executable, multimodal real-world scenarios that expose actual model failures.

⚙️ Technical Details

Problem Definition

Setting: Multimodal tool-use planning and execution

Inputs: A set of image files F and a natural language query Q (subjective, objective, or image generation)

Outputs: A tool invocation chain C and a final answer A

Pipeline Flow

Input Processing (User Query + Image Context)
Planning (LLM Controller decides tool chain)
Execution (Real Deployed Tools run steps)
Response Generation (Final Answer)

System Modules

LLM Controller

Central brain that reasons about the query, selects tools from the library, and plans the sequence of actions

Model or implementation: Various evaluated LLMs (e.g., GPT-4)

Tool Library

Execute specific functions required by the plan

Model or implementation: 14 Real Deployed Tools

Novel Architectural Elements

Integration of real executable tools (not mocked) into the evaluation loop
Mandatory multimodal context processing for tool selection

Modeling

Base Model: Evaluates 16 mainstream LLMs (including GPT-4)

Comparison to Prior Work

vs. ToolBench: GTA uses human-designed queries with implicit steps instead of AI-generated explicit instructions
vs. GAIA: GTA focuses specifically on tool agents with executable tool chains and multimodal inputs, whereas GAIA is broader AGI evaluation
vs. API-Bank: GTA includes real multimodal inputs (images) and deployed tools, whereas API-Bank is primarily text-based

Limitations

Evaluation is time-sensitive for search queries (answers may change over time), requiring constraints on query design.
Reliance on specific deployed tools limits the scope to the 14 provided tools.
Subjective queries require reference answers which may not cover all valid variations.

Reproducibility

Code: https://github.com/open-compass/GTA

publicly available (https://github.com/open-compass/GTA). The dataset contains 229 human-designed samples including images, queries, and executable ground-truth tool chains.

📊 Experiments & Results

Evaluation Setup

End-to-end task execution using a library of 14 tools across perception, operation, logic, and creativity.

Benchmarks:

GTA (Multimodal Tool Use) [New]

Metrics:

Task Completion Rate (Success Rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance on the GTA benchmark highlights a significant gap between current SOTA models and the requirements of real-world tool agents.
GTA	Task Completion Rate	100.0	50.0	-50.0
GTA	Task Completion Rate	100.0	25.0	-75.0

Experiment Figures

Overview of the GTA benchmark content, displaying the tool categories and examples of multimodal queries.

Main Takeaways

Real-world queries with implicit steps are significantly harder than AI-generated explicit queries found in previous benchmarks.
Multimodal context is a bottleneck; models struggle to integrate visual information into tool planning.
There is a massive performance gap, with even the best model (GPT-4) failing more than half the time, suggesting current agents are not yet 'general' tool users.

📚 Prerequisite Knowledge

Prerequisites

Agentic AI frameworks (ReAct, AutoGPT)
Tool Learning / API integration
Multimodal Large Language Models (MLLMs)

Key Terms

Implicit tool-use: Scenarios where the user does not specify which tools to use, requiring the agent to reason and select appropriate tools autonomously.

Tool Chain: A sequence of steps where each step involves a tool name, arguments, and return values needed to solve a complex task.

ReAct: Reason+Act—a prompting paradigm where LLMs generate reasoning traces before executing actions.

Subjective query: Tasks where the answer is descriptive text (not unique but conceptually consistent), evaluated against reference answers.

Objective query: Tasks where the answer is a uniquely determined number or phrase.

OCR: Optical Character Recognition—converting images of text into machine-encoded text.