Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents

📝 Paper Summary

Tool-use post-training Multi-call tool use with flexible plan

AutoTools enables LLMs to automatically convert raw tool documentation into verified, executable Python functions and solve tasks by generating programs, without requiring manual prompt engineering.

Core Problem

Existing tool-use methods rely on manual parsing of documentation and rigid, pre-defined templates (like JSON), which scale poorly to large toolsets and limit flexibility.

Why it matters:

Manually crafting demonstrations for thousands of APIs requires intense domain expertise and effort, creating a bottleneck for scaling tool agents
Fixed templates (e.g., JSON schemas) struggle to handle complex dependencies where the output of one tool must be processed before becoming the input for another
LLMs often fail when in-context examples are missing or incomplete, limiting their ability to use new tools 'in the wild'

Concrete Example: A task might require retrieving a movie's credits using a unique ID found via a search tool. Current methods struggle to pass the search output to the credit tool without explicit manual examples. AutoTools handles this by writing a Python script that stores the search result in a variable and passes it to the next function.

Key Novelty

AutoTools Framework

Self-Encapsulation: Instead of human-written wrappers, the LLM reads raw documentation and writes its own Python function wrappers, including docstrings
Integration Verification: The model generates its own test cases to verify these functions work, checking for runtime errors and input-output dependencies
Tool Programming: The agent solves tasks by writing executable Python code that calls these self-generated functions, allowing for logic like loops and variable storage

Architecture

The complete AutoTools framework, split into the Tool Encapsulation stage (converting docs to functions) and the Tool Programming stage (solving queries).

Evaluation Highlights

AutoTools with GPT-4 achieves 64.1 Pass Rate on the ToolBench benchmark, outperforming ToolLLM (using Llama-2-7B) which scored 56.8
On the new AutoTools-Eval benchmark, the proposed AutoTools-L-13B model achieves a 57.6 Pass Rate, surpassing GPT-3.5-Turbo (51.8)
The method is highly efficient, using significantly fewer tokens than ReAct or ToolLLM baselines while maintaining higher accuracy

Breakthrough Assessment

8/10

Significant shift from manual tool definition to fully automated encapsulation and verification. The move to programmatic interaction (Python) over JSON parsing for tool use addresses key flexibility bottlenecks.

⚙️ Technical Details

Problem Definition

Setting: Given a user query q and a set of raw tool documentation D, generate an executable solution s that invokes tools to answer q.

Inputs: Natural language query q, raw tool documentation D (e.g., from RapidAPI)

Outputs: Executable Python program s, final answer r

Pipeline Flow

Documentation Input → Tool Encapsulation (LLM) → Syntax Check (AST) → Integration Verification (Runtime Test) → Verified Function Library
User Query + Function Library → Tool Programming (LLM generates Python code) → Execution Environment → Final Answer

System Modules

Tool Encapsulator (Tool Encapsulation)

Converts raw documentation into Python functions

Model or implementation: LLM (e.g., GPT-4 or AutoTools-L-13B)

Syntax Checker (Tool Encapsulation)

Validates code structure

Model or implementation: Python AST compiler

Integration Verifier (Tool Encapsulation)

Checks runtime correctness and dependencies

Model or implementation: LLM + Python Executor

Program Generator

Solves user query using verified functions

Model or implementation: LLM

Novel Architectural Elements

Two-stage pipeline separating tool definition (Encapsulation) from tool usage (Programming)
Self-verification loop (Integration Verification) where the model generates its own integration tests to validate tool wrappers before deployment

Modeling

Base Model: Llama-2-13B (for AutoTools-L-13B variant)

Training Method: Multi-task Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood of the target tokens.

Formally: Standard language modeling loss L = - sum log P(y_t | y_<t, x)

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (13B)

Training Data:

34k instances total synthesized from RestBench and ToolBench
Task 1: Tool Understanding (Documentation -> Function)
Task 2: Relevance Learning (Query -> Relevant Tools)
Task 3: Function Learning (Query + Functions -> Python Program)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 3
+ 1 more
max_seq_length: 4096

Compute: Training performed on 8 NVIDIA A800 GPUs

Comparison to Prior Work

vs. ToolLLM: AutoTools uses programmatic (Python) interaction instead of JSON/DFS, and automates tool wrapping rather than using pre-processed APIs
vs. RestGPT: AutoTools pre-compiles tools into functions with verification, whereas RestGPT parses raw documentation on-the-fly during inference (slower, less reliable)
vs. Gorilla [not cited in paper]: Gorilla fine-tunes for API calls but relies on static retrieval; AutoTools dynamically encapsulates new tools from documentation 'in the wild'

Limitations

Dependency on the quality of raw documentation; extremely poor or missing documentation may fail the encapsulation stage
Runtime verification requires actual execution, which might have side effects (e.g., deleting data) if not sandboxed properly (not explicitly discussed)
Latency of the encapsulation stage could be high for very large initial toolsets, though it is a one-time cost

Reproducibility

Code: https://github.com/Ren-Research/AutoTools

Code is publicly available at https://github.com/Ren-Research/AutoTools. The authors released 34k high-quality synthetic training data. Specific prompt templates are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Tool-use capabilities evaluated on existing benchmarks and a new, harder benchmark focusing on complex dependencies.

Benchmarks:

RestBench (Real-world RESTful API usage)
ToolBench (Large-scale instruction tuning benchmark for tools)
AutoTools-Eval (Complex tool usage with strong input-output dependencies) [New]

Metrics:

Pass Rate (Success Rate)
Win Rate (vs. ChatGPT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on standard benchmarks (ToolBench & RestBench) showing superiority over baselines.
ToolBench	Pass Rate	56.8	64.1	+7.3
RestBench	Pass Rate	59.6	82.5	+22.9
Performance on the newly constructed, more challenging AutoTools-Eval benchmark.
AutoTools-Eval	Pass Rate	51.8	57.6	+5.8
AutoTools-Eval	Pass Rate	44.6	57.6	+13.0

Experiment Figures

Pass rates of different models on the AutoTools-Eval benchmark.

Main Takeaways

Programmatic interaction (generating Python code) significantly outperforms JSON/text-based tool use, especially for tasks requiring parameter reuse or logic flow.
The automatic encapsulation stage allows models to verify tools *before* using them, filtering out hallucinated or incorrect tool usages that plague other methods.
Fine-tuning (AutoTools-Learning) effectively distills the capabilities of larger models into smaller open-source models (13B), enabling them to outperform larger baselines like GPT-3.5-Turbo on complex tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and In-Context Learning
Familiarity with API documentation structures (endpoints, arguments)
Basic knowledge of Python programming and Abstract Syntax Trees (AST)

Key Terms

Tool Encapsulation: The process of converting raw, text-based API documentation into a structured, callable code function (e.g., a Python function def)

Tool Programming: A method where the LLM solves tasks by writing executable code (programs) rather than just generating text or JSON, allowing for loops and variable manipulation

AST: Abstract Syntax Tree—a tree representation of the syntactic structure of source code, used here to check if generated functions are syntactically valid

ReAct: Reason+Act—a prompting technique where models alternate between reasoning traces and action generation (tool calls)

Pass Rate: The percentage of test cases (queries) that are successfully solved by the model

Integration Verification: A proposed validation step where the LLM generates test inputs (potentially using other tools) to run a newly created function and ensure it works before adding it to the library

SFT: Supervised Fine-Tuning—training a model on a specific dataset to improve its performance on a target task

JSON: JavaScript Object Notation—a standard text-based format for representing structured data, commonly used in prior work for formatting tool calls