API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

📝 Paper Summary

Benchmark datasets Tool-use post-training Multi-call tool use with flexible plan

API-Bank is a benchmark for evaluating tool-augmented LLMs that measures planning, retrieving, and calling abilities, accompanied by a multi-agent synthetic data generation method to train a capable model called Lynx.

Core Problem

Existing LLM benchmarks lack a comprehensive evaluation of tool usage capabilities (planning, retrieving, calling) and often rely on limited or unrealistic API sets.

Why it matters:

Current LLMs (like GPT-3) struggle with API usage without specific instruction tuning, limiting their ability to interact with the real world.
Manual annotation of diverse tool-use dialogues is prohibitively expensive ($8 per dialogue), hindering the creation of large-scale training sets.
There is a lack of standardized metrics to measure the gap between open-source models and state-of-the-art closed models (like GPT-4) in complex tool-use scenarios.

Concrete Example: When a user asks a complex question requiring multiple steps (e.g., 'Check my calendar and book a flight if free'), a standard LLM might hallucinate a response or fail to sequence the APIs correctly. In API-Bank, the model must first retrieve the 'Calendar' API, check availability, then retrieve the 'Flight' API and book it, managing dependencies between these calls.

Key Novelty

API-Bank Benchmark & Lynx Model

Defines a three-level evaluation grading system for tool use: Call (standard slot filling), Retrieval+Call (finding the right tool), and Plan+Retrieval+Call (multi-step reasoning).
Introduces a 'Multi-agent' data generation pipeline where five LLM agents collaborate to synthesize diverse domains, APIs, and dialogues, reducing annotation costs by 98%.
Implements an executable evaluation system with 73 real APIs and databases to measure correctness based on actual execution results rather than just text matching.

Architecture

The Multi-agent data generation pipeline used to create the training set.

Evaluation Highlights

Lynx (initialized from Alpaca-7B) achieves 49.87% accuracy on API calls, surpassing Alpaca-7B by ~26 percentage points and approaching GPT-3.5-turbo (59.40%).
GPT-4 significantly outperforms GPT-3.5 on the hardest 'Plan+Retrieve+Call' task (70.00% vs 22.00% accuracy), showing superior reasoning capabilities.
The Multi-agent data generation method reduces cost to $0.10 per dialogue (vs. $8 for human annotation) while maintaining a 94% data availability rate.

Breakthrough Assessment

8/10

A comprehensive benchmark that addresses the critical gap in evaluating tool-augmented LLMs. The executable environment and multi-level ability grading set a new standard, though the primary model (Lynx) is a 7B fine-tune rather than a new architecture.

⚙️ Technical Details

Problem Definition

Setting: Tool-augmented dialogue generation where an LLM must interact with an external API executor to fulfill user queries.

Inputs: User query history and a pool of potential API definitions (or access to an API search tool).

Outputs: Correct API call sequences (including parameters) and a final natural language response.

Pipeline Flow

User Query
LLM Planning/Retrieval (decides if API is needed)
API Search (if API not in context)
API Call Generation (LLM generates JSON)
Execution (System runs API against database/mock)
Response Generation (LLM incorporates API output)

System Modules

Evaluation System

Manages 73 implemented APIs and databases; executes calls and returns results.

Model or implementation: Python-based Execution Engine

Lynx

The tool-augmented LLM being evaluated.

Model or implementation: Fine-tuned Alpaca-7B

Novel Architectural Elements

Multi-agent data synthesis pipeline: 5 agents (Domain Generator, API Generator, Query Generator, Dialogue Generator, Tester) working sequentially to create training data.

Modeling

Base Model: Alpaca-7B (based on LLaMA-7B)

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

1,888 tool-use dialogues generated by the Multi-agent pipeline.
Spans 1,000 distinct domains and 2,138 APIs.

Key Hyperparameters:

epochs: 3
batch_size: 256
learning_rate: 2e-5

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolAlpaca: API-Bank has significantly more domains (1,000 vs 50) and a runnable evaluation system (ToolAlpaca relies mostly on simulated text)
vs. APIBench: API-Bank includes multi-turn dialogues and state-tracking databases, whereas APIBench focuses largely on single-turn API retrieval/calling correctness

Limitations

Benchmark focuses solely on English; other languages are not supported.
Evaluation is limited to 7B parameter models for fine-tuning (Lynx), larger scale fine-tuning not explored.
API Search retrieval mechanism relies on keyword similarity, which may be simplistic compared to dense retrieval.
Requires ground truth annotation for 'correctness', which is expensive to scale for the test set.

Reproducibility

Code: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank

publicly available (https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank). Includes the API-Bank data (training/test), the Lynx model weights (delta), and evaluation scripts. 73 APIs are implemented.

📊 Experiments & Results

Evaluation Setup

Interact with a Python-based system implementing 73 APIs. Models must generate correct API calls that successfully execute against a database.

Benchmarks:

API-Bank Evaluation System (Tool-use Dialogue) [New]

Metrics:

Accuracy (Correctness of API calls)
ROUGE-L (Quality of final response)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing Lynx to baselines on API-Bank across three difficulty levels (Call, Retrieve+Call, Plan+Retrieve+Call).
API-Bank (Total)	Accuracy	15.19%	39.58%	+24.39%
API-Bank (Total)	Accuracy	47.16%	39.58%	-7.58%
API-Bank (Plan+Retrieve+Call)	Accuracy	22.00%	20.00%	-2.00%
API-Bank (Plan+Retrieve+Call)	Accuracy	70.00%	20.00%	-50.00%
ToolAlpaca Evaluation	Accuracy (Call)	53.88%	54.64%	+0.76%

Main Takeaways

Basic LLMs (Alpaca, ChatGLM) have rudimentary tool-use ability (~20%) but fail at planning/retrieval.
GPT-3 Davinci fails almost completely (0.57% accuracy), suggesting instruction tuning is crucial for tool use.
GPT-4 excels at planning and multi-step reasoning (Plan+Retrieve+Call), significantly outperforming GPT-3.5.
Synthetic data generation via Multi-agent collaboration is highly effective, allowing a 7B model (Lynx) to approach GPT-3.5 performance.
Major error sources include 'No API Call' (for base models) and 'API Hallucination' (for fine-tuned models like Lynx).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and instruction tuning
Familiarity with API structures (JSON, input/output parameters)
Basic concepts of dialogue systems and slot filling

Key Terms

API-Bank: A benchmark for tool-augmented LLMs containing an evaluation system with 73 APIs and a training set of 1,888 dialogues.

Lynx: A tool-augmented LLM fine-tuned from Alpaca-7B using the API-Bank training dataset.

Multi-agent: A data generation method proposed in this paper where 5 specialized LLM agents generate domains, APIs, queries, and dialogues.

Plan+Retrieve+Call: The most complex evaluation setting where the model must plan a sequence of steps, search for unknown APIs, and execute them.

ROUGE-L: A metric used to evaluate the quality of the final natural language response by comparing it to a reference.

Alpaca: An instruction-tuned version of the LLaMA-7B model.

Hallucination: In this context, when the model generates an API call for a tool that does not exist or was not retrieved.