Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

📝 Paper Summary

Multi-call tool use with fixed plan Invoking internalized APIs

LLMs can effectively use new tools zero-shot by reading documentation (manuals) rather than relying on few-shot demonstrations, enabling scalable use of hundreds of unseen APIs.

Core Problem

Current LLM tool-use relies on few-shot demonstrations, which are hard to acquire, difficult to select without bias, and combinatorially intractable as the number of available tools scales up.

Why it matters:

Selecting the 'right' few-shot demonstrations is difficult and biased selection can degrade performance
Providing demonstrations for hundreds of tools exceeds context windows and requires immense manual curation effort
Real-world APIs change frequently; maintaining up-to-date demonstrations for every version is impractical compared to using existing documentation

Concrete Example: When using a new cloud CLI tool, an LLM relying on few-shot demos might hallucinate a '-P' flag for port specification based on familiar Linux commands (scp), whereas an LLM reading the documentation correctly identifies the specific '--port' flag required by the new tool.

Key Novelty

Documentation-based Zero-Shot Tool Use

Replace few-shot input-output examples (demonstrations) with textual descriptions of tool functionality and usage (documentation) in the prompt
Enable 'plug-and-play' usage of completely new tools (e.g., GroundingDINO, Track Anything) by simply pasting their README/docs into the context, without curating specific demos

Architecture

Contrast between Demonstration-based prompting (Left) and Documentation-based prompting (Right) for tool use.

Evaluation Highlights

Zero-shot usage with documentation achieves comparable or better performance than few-shot usage on ScienceQA (79.91 vs 78.54) and TabMWP (92.69 vs 89.28)
On a new dataset of 200 unseen Google Cloud CLI tools, documentation-based prompting outperforms few-shot demonstrations by ~2.3x (F1 score 0.45 vs 0.19)
Successfully 're-invents' state-of-the-art pipelines like Grounded-SAM and Track Anything zero-shot by combining documentation from constituent tools (SAM, GroundingDINO, XMem)

Breakthrough Assessment

7/10

Strong empirical evidence that documentation is a more scalable alternative to demonstrations for tool use. While the method is simple prompt engineering, the finding significantly lowers the barrier for deploying LLMs with massive, unseen toolsets.

⚙️ Technical Details

Problem Definition

Setting: Tool-use planning where an LLM generates a program/sequence of tool calls given an instruction and a set of available tools.

Inputs: Natural language instruction x, Tool set T (containing documentation D but no demonstrations)

Outputs: Executable plan/program p involving calls to tools in T

Pipeline Flow

Input Instruction Processing
Tool Retrieval (for large toolsets)
Prompt Construction (Instruction + Tool Documentation)
LLM Planning
Program Execution

System Modules

Tool Retriever

Select relevant tools from a large library when they don't all fit in context (used for Cloud CLI task)

Model or implementation: TF-IDF search

LLM Planner

Generate a sequence of tool calls based on the instruction and provided documentation

Model or implementation: gpt-3.5-turbo (ChatGPT) or text-davinci-002

Program Executor

Execute the generated plan using the actual tools/APIs

Model or implementation: Various underlying tools (e.g., Python interpreter, GroundingDINO, Google Cloud CLI)

Novel Architectural Elements

Documentation-only prompting strategy: Constructing prompts exclusively from tool definitions/manuals without input-output exemplars for planning

Modeling

Base Model: gpt-3.5-turbo (ChatGPT) and text-davinci-002

Compute: Not reported in the paper (Inference-only approach)

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot tool-use planning across text, vision, and API tasks

Benchmarks:

ScienceQA (Multi-modal Science Question Answering)
TabMWP (Tabular Math Word Problems)
NLVRv2 (Visual Reasoning)
LLM Cloud CLI (API Usage / Command Line Generation) [New]

Metrics:

Accuracy (ScienceQA, TabMWP, NLVRv2)
Command-line level F1 score (LLM Cloud CLI)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis on standard benchmarks showing Zero-shot with Documentation matches or beats Few-shot baselines.
ScienceQA	Accuracy	78.54	79.91	+1.37
TabMWP	Accuracy	89.28	92.69	+3.41
NLVRv2	Accuracy	76.30	63.40	-12.90
Results on the newly collected LLM Cloud CLI benchmark demonstrating scalability to unseen tools.
LLM Cloud CLI	F1 Score	0.19	0.45	+0.26
LLM Cloud CLI	F1 Score	0.05	0.37	+0.32

Experiment Figures

Performance curves (Accuracy) vs. Number of Demonstrations for three datasets (ScienceQA, TabMWP, NLVRv2) with and without documentation.

F1 Score on LLM Cloud CLI vs. Documentation Length (number of words) for zero-shot models.

Main Takeaways

Zero-shot prompts with tool documentation perform on par with or better than few-shot prompts on standard benchmarks.
Documentation is far more effective than demonstrations for scaling to large toolsets (e.g., 200+ CLI tools), where demonstrations suffer from poor coverage.
LLMs can 're-invent' complex pipelines (like Grounded-SAM) simply by reading the documentation of component tools, without needing explicit workflow demonstrations.
Performance improves with documentation length up to ~600 words, after which it degrades, suggesting a sweet spot for tool description detail.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with In-Context Learning (ICL) and Few-Shot Prompting
Understanding of LLM tool-use paradigms (e.g., ReAct, Toolformer)
Basic knowledge of vision-language tasks (VQA, Visual Grounding)

Key Terms

Demonstrations (demos): Few-shot examples of <input, tool-plan> pairs provided in the prompt to teach the model how to use tools

Documentation (docs): Textual descriptions of a tool's functionality, inputs, and parameters (similar to a README or man page) provided in the prompt

VisProg: Visual Programming—a framework where LLMs generate python-like modular programs to solve visual tasks

GroundingDINO: An open-set object detector that can detect objects based on arbitrary text descriptions

SAM: Segment Anything Model—a promptable image segmentation model capable of generating masks for any object

XMem: A video object segmentation model used for tracking objects across video frames

TF-IDF: Term Frequency-Inverse Document Frequency—a statistical method used here to retrieve relevant tool documentation based on the input query

Zero-shot: Evaluating the model's ability to solve a task without seeing any specific examples (demonstrations) of that task in the prompt