← Back to Paper List

Tool Learning with Large Language Models: A Survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen
Renmin University of China
arXiv (2024)
Agent Benchmark Reasoning RAG MM

📝 Paper Summary

Tool Learning Agentic AI Multi-call tool use with flexible plan
This survey systematizes tool learning with LLMs into a four-stage workflow (planning, selection, calling, generation) and categorizes existing methods by whether they require model tuning or operate tuning-free.
Core Problem
Despite rapid advancements, literature on LLM tool learning is fragmented, lacking a unified taxonomy to understand how models plan, select, and execute external tools.
Why it matters:
  • LLMs suffer from hallucinations and outdated knowledge; integrating tools is essential for reliability but implementation varies widely across papers
  • Newcomers face barriers to entry due to inconsistent terminology (e.g., distinguishing tools vs. APIs) and scattered evaluation benchmarks
  • Existing surveys often treat tool use as a sub-feature of agents or reasoning rather than a dedicated paradigm with its own distinct workflow stages
Concrete Example: An LLM asked to 'calculate 13^4' might hallucinate a plausible-looking but wrong number. A tool-augmented LLM must recognize the need for a calculator, generate the API call, execute it, and integrate the exact result—a multi-stage process that fails if any single component (planning, selection, or execution) is flawed.
Key Novelty
Systematic Taxonomy of the Tool Learning Workflow
  • Decomposes the tool learning process into four distinct stages: Task Planning (intent detection/decomposition), Tool Selection (finding the right API), Tool Calling (generating parameters), and Response Generation (integrating results)
  • Categorizes methods within these stages into 'tuning-free' (prompt engineering/ICL) vs. 'tuning-based' (fine-tuning/RL) approaches, providing a clear structural framework for the field
Evaluation Highlights
  • Compiles over 30 benchmarks, categorizing them into general tool use (e.g., ToolBench, APIBench) and domain-specific tasks (e.g., ToolQA, ToolSandbox)
  • Identifies that while general benchmarks like ToolBench cover broad API landscapes, newer benchmarks focus on safety (ToolSword) and robustness (RoTBench)
  • Highlights that pass rate and win rate are dominant metrics, but response generation is often evaluated with standard NLP metrics like BLEU and ROUGE-L
Breakthrough Assessment
8/10
A comprehensive foundational survey that organizes a chaotic field. While it doesn't propose a new model, its taxonomy is likely to become the standard reference for future tool learning research.
×