Tool Learning with Large Language Models: A Survey

📝 Paper Summary

Tool Learning Agentic AI Multi-call tool use with flexible plan

This survey systematizes tool learning with LLMs into a four-stage workflow (planning, selection, calling, generation) and categorizes existing methods by whether they require model tuning or operate tuning-free.

Core Problem

Despite rapid advancements, literature on LLM tool learning is fragmented, lacking a unified taxonomy to understand how models plan, select, and execute external tools.

Why it matters:

LLMs suffer from hallucinations and outdated knowledge; integrating tools is essential for reliability but implementation varies widely across papers
Newcomers face barriers to entry due to inconsistent terminology (e.g., distinguishing tools vs. APIs) and scattered evaluation benchmarks
Existing surveys often treat tool use as a sub-feature of agents or reasoning rather than a dedicated paradigm with its own distinct workflow stages

Concrete Example: An LLM asked to 'calculate 13^4' might hallucinate a plausible-looking but wrong number. A tool-augmented LLM must recognize the need for a calculator, generate the API call, execute it, and integrate the exact result—a multi-stage process that fails if any single component (planning, selection, or execution) is flawed.

Key Novelty

Systematic Taxonomy of the Tool Learning Workflow

Decomposes the tool learning process into four distinct stages: Task Planning (intent detection/decomposition), Tool Selection (finding the right API), Tool Calling (generating parameters), and Response Generation (integrating results)
Categorizes methods within these stages into 'tuning-free' (prompt engineering/ICL) vs. 'tuning-based' (fine-tuning/RL) approaches, providing a clear structural framework for the field

Evaluation Highlights

Compiles over 30 benchmarks, categorizing them into general tool use (e.g., ToolBench, APIBench) and domain-specific tasks (e.g., ToolQA, ToolSandbox)
Identifies that while general benchmarks like ToolBench cover broad API landscapes, newer benchmarks focus on safety (ToolSword) and robustness (RoTBench)
Highlights that pass rate and win rate are dominant metrics, but response generation is often evaluated with standard NLP metrics like BLEU and ROUGE-L

Breakthrough Assessment

8/10

A comprehensive foundational survey that organizes a chaotic field. While it doesn't propose a new model, its taxonomy is likely to become the standard reference for future tool learning research.

⚙️ Technical Details

Problem Definition

Setting: Augmenting LLMs with external interfaces (tools/APIs) to solve tasks requiring dynamic interaction, current knowledge, or specialized computation

Inputs: User query q and a set of available tools T

Outputs: Final response r generated after executing a sequence of tool calls

Pipeline Flow

Task Planning (Decompose user query)
Tool Selection (Retrieve/Choose relevant tools)
Tool Calling (Generate API parameters and execute)
Response Generation (Integrate tool outputs into final answer)

System Modules

Task Planning

Analyze user intent and decompose complex queries into solvable subtasks

Model or implementation: Various (CoT, ReAct, Plan-and-Solve)

Tool Selection

Select appropriate tools from a pool to fulfill the planned subtasks

Model or implementation: Dense Retrievers (DPR, Contriever) or LLM-based selection

Tool Calling

Generate valid arguments for the selected tool and execute it

Model or implementation: LLM (often fine-tuned like Gorilla or prompted like GPT-4)

Response Generation

Synthesize the tool execution results with the original context to answer the user

Model or implementation: LLM

Novel Architectural Elements

Taxonomy distinguishes between 'Single Invocation' (one-off tool use) and 'Iterative Invocation' (dynamic multi-turn interaction) paradigms

Comparison to Prior Work

vs. Mialon et al. (2023): This survey provides a more granular breakdown of the workflow stages (planning vs. selection vs. calling) rather than just categorizing tool types
vs. Qin et al. (2023): Focuses more extensively on the implementation methods ('how') and benefits ('why') in addition to evaluation
vs. Wang et al. (2024): Structures the review specifically around the four-stage execution pipeline rather than usage scenarios

Limitations

High latency in tool learning workflows due to multiple model calls and API round-trips
Limited real-world benchmarks; most are static or simulated environments rather than live API interactions
Safety concerns regarding autonomous agents executing harmful API calls (e.g., deleting files, spending money)

Reproducibility

Code: https://github.com/quchangle1/LLM-Tool-Survey

Survey paper; compiles existing works. The authors provide a GitHub repository tracking the cited papers and benchmarks.

📊 Experiments & Results

Evaluation Setup

Meta-analysis of evaluation methods in the field of Tool Learning

Benchmarks:

ToolBench (General instruction following with ~16k real APIs)
APIBench (Code generation and API calling correctness)
ToolQA (Question answering requiring external tool use)
ToolAlpaca (Simulated tool use environment)

Metrics:

Pass Rate (success of task completion)
Win Rate (preference vs. baseline, often judged by ChatGPT)
Tool Usage Awareness (accuracy of deciding when to use a tool)
Argument Hallucination Rate
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Evaluation is shifting from simple text metrics (ROUGE/BLEU) to execution-based metrics (Pass Rate) and model-based evaluation (LLM-as-a-Judge)
A major challenge is the 'Knowledge Acquisition' gap where LLMs must determine *when* to use a tool versus relying on internal parameters
Tuning-free methods (Prompting) are accessible but often struggle with complex API schemas compared to Tuning-based methods (Fine-tuning)
Future directions include unified frameworks, addressing high latency, and developing safer, more robust agents

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs)
Familiarity with In-Context Learning (ICL) and Chain-of-Thought (CoT)
Concept of APIs and function calling

Key Terms

Tool Learning: The paradigm where LLMs interact with external tools (APIs, interpreters) to extend their capabilities beyond parametric knowledge

Tuning-free Methods: Approaches that enable tool use via prompt engineering or in-context learning without updating model weights (e.g., ReAct, CoT)

Tuning-based Methods: Approaches that fine-tune the LLM specifically for tool usage, often using specialized datasets (e.g., Toolformer, Gorilla)

Task Planning: The stage where the LLM decomposes a user query into subtasks or a sequence of necessary actions

Tool Selection: The process of identifying the most appropriate tool from a candidate set to address a specific subtask

DFSDT: Depth-First Search Decision Tree—a planning algorithm used in methods like ToolLLaMA to explore reasoning paths

API: Application Programming Interface—a structured way for the LLM to interact with external software

Hallucination: When an LLM generates plausible but factually incorrect information; tool learning aims to mitigate this