ToolBench: On the Tool Manipulation Capability of Open-source LLMs

📝 Paper Summary

Multi-call tool use with fixed plan Tool-use post-training

The paper boosts open-source LLMs' tool use capabilities to rival GPT-4 by aligning them with programmatically generated API data, using retrieval-based demonstrations, and enforcing code-only generation.

Core Problem

Open-source LLMs severely lag behind closed APIs like GPT-4 in tool manipulation, often failing to select correct APIs, populating wrong arguments, or generating non-executable text.

Why it matters:

Relying on closed APIs (GPT-4) poses security and privacy risks for enterprise adoption, as internal workflows must be exposed to external services
There is a huge performance disparity: LLaMA fails completely (0%) on a house searching tool where GPT-4 achieves 77%, hindering open-source adoption
Open-source models struggle with API hallucinations and argument formatting without specific alignment, unlike closed models that seemingly internalize this during training

Concrete Example: In a weather query task, when asked 'how to move a robot to (20, 30)?', an open-source model might hallucinate 'robot.raise_arm(20)' (wrong API) or 'robot.move_to(30, 20)' (wrong arguments), whereas GPT-4 correctly generates 'robot.move_to(20, 30)'.

Key Novelty

Programmatic Alignment & Retrieval for Tool Use (ToolBench Recipe)

Bootstrap training data programmatically by creating a few templates per tool and filling them with random values, requiring minimal human effort (approx. 1 day per tool)
Align open-source models on this synthetic data to internalize API signatures and usage patterns
Augment inference with a 'demonstration retriever' that fetches relevant few-shot examples based on goal similarity, handling unseen API combinations

Architecture

The tool manipulation setup illustrating both single-step and multi-step scenarios.

Evaluation Highlights

Boosted open-source LLMs (LLaMA-30b, StarCoder) achieve competitive or better success rates than GPT-4 in 4 out of 8 ToolBench tasks
Enhanced LLaMA-30b improves from 0% to 87% success rate on the Home Search task, narrowing the gap with enhanced GPT-4 (98%)
Programmatic alignment and retrieval techniques boost open-source success rates by up to 90% compared to out-of-the-box zero-shot performance

Breakthrough Assessment

7/10

Significant for demonstrating that open-source models can bridge the gap to GPT-4 in tool use with low-cost synthetic data, though they still struggle on reasoning-heavy tasks.

⚙️ Technical Details

Problem Definition

Setting: Map a natural language goal 'g' to a sequence of API calls 'Cg' using an action generator 'A' augmented with API documentation 'D'

Inputs: Natural language goal g, API documentation D, optional information O

Outputs: Executable sequence of API calls Cg (e.g., 'robot.move_to(20, 30)')

Pipeline Flow

Input Goal Processing
Demonstration Retrieval
Action Generation

System Modules

Demonstration Retriever

Retrieve relevant API usage examples based on the input goal

Model or implementation: BM25 (off-the-shelf retriever implementation)

Action Generator

Generate the specific API calls to fulfill the goal

Model or implementation: Fine-tuned Open-Source LLM (LLaMA-30b, StarCoder, or CodeGen)

Novel Architectural Elements

Integration of a lightweight demonstration retriever requiring only O(n) examples (linear to API count) to generalize to exponential API combinations
Programmatic data generation pipeline using templates to fine-tune models specifically for tool usage without massive manual labeling

Modeling

Base Model: LLaMA-30b, StarCoder, CodeGen-16B-mono

Training Method: Instruction Tuning (Supervised Fine-Tuning)

Adaptation: Full fine-tuning (implied, as specific adapter methods aren't mentioned)

Training Data:

Programmatically generated using templates
Requires ~100 templates per tool
Templates contain placeholders filled by random value pools

Key Hyperparameters:

computational_requirements: Not reported in the paper

Comparison to Prior Work

vs. Toolformer: Toolformer relies on massive self-play/filtering; ToolBench uses explicit programmatic supervision and retrieval
vs. Auto-GPT: Auto-GPT uses chaining on closed APIs; ToolBench focuses on enabling open-source models to perform single/multi-step actions
vs. OpenAI GPT-4: Comparison target; ToolBench aims to match its zero-shot/few-shot performance using fine-tuned open models

Limitations

Open-source models still struggle significantly with 'advanced reasoning' tasks (e.g., Google Sheets, Tabletop) compared to GPT-4
Requires per-tool engineering (templates, value pools) for the data generation process (approx. 1 day per tool)
Evaluation is limited to the 8 tools in the ToolBench suite
The retrieval mechanism is simple (BM25-like) and may not handle complex semantic matching for very large API pools

Reproducibility

Code: https://github.com/sambanova/toolbench

Code and benchmark available at https://github.com/sambanova/toolbench. The paper describes the data generation process (templates + random values) but does not provide the exact training hyperparameters (learning rate, batch size, etc.) or the final model weights.

📊 Experiments & Results

Evaluation Setup

Tool manipulation using API calls, evaluated via execution success rate

Benchmarks:

ToolBench (Tool Manipulation (API Call Generation)) [New]

Metrics:

Success Rate (Execution-based)
Reward (for WebShop)
Executability
Longest Common Subsequence (LCS)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Enhanced open-source models (tuned + retriever + system prompt) show massive gains over zero-shot baselines and become competitive with GPT-4 on simpler tasks.
Home Search	Success Rate	0.0	87.0	+87.0
Open Weather	Success Rate	39.0	100.0	+61.0
Trip Booking	Success Rate	0.0	85.8	+85.8
WebShop	Reward	22.0	31.0	+9.0
Google Sheets	Success Rate	5.9	21.2	+15.3
Ablation studies reveal that model alignment (fine-tuning) is the most critical component for performance.
Average across tasks (Task Count)	Tasks Improved	0	-5	-5

Experiment Figures

Comparison of API selection accuracy between GPT-4 and open-source models without documentation (left) and impact of one-shot demonstration on OpenWeather (right).

The programmatic training data generation process.

Main Takeaways

Open-source LLMs can be boosted to match GPT-4 on specific tool use tasks using a combination of synthetic data alignment, retrieval, and system prompts.
Model alignment (fine-tuning) is the most impactful factor, addressing API selection and argument population failures.
In-context demonstration retrieval is essential for generalizing to unseen API combinations, requiring only linear O(n) examples.
A significant gap remains on complex reasoning tasks (Google Sheets, Tabletop) where open-source models still lag behind GPT-4 even with enhancements.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and instruction tuning
Familiarity with API (Application Programming Interface) structures
Knowledge of few-shot in-context learning

Key Terms

ToolBench: The benchmark suite proposed in this paper containing 8 diverse software tools for evaluating tool manipulation capabilities

Action Generator: The LLM component responsible for translating natural language goals into executable API code

Demonstration Retriever: A module that selects relevant in-context examples from a pool based on semantic similarity to the current goal

API Complexity: A metric defined in the paper quantifying the difficulty of generalizing to unseen API combinations based on distance from demonstration examples

System Prompt: A fixed instruction prepended to the model input to regulate generation style (e.g., forcing code-only output)

Programmatic Data Generation: Creating training data by defining templates with placeholders and filling them with random values from a pool

LCS: Longest Common Subsequence—a metric used to evaluate sequence generation similarity