(APIBench) Gorilla: Large Language Model connected with Massive APIs

📝 Paper Summary

Multi-call tool use with flexible plan Tool profiling

Gorilla is a LLaMA-based model fine-tuned on a massive dataset of API documentation (APIBench) to generate accurate, constraint-aware API calls while reducing hallucination.

Core Problem

LLMs struggle to effectively use tools via API calls, often hallucinating non-existent libraries or generating incorrect input arguments when faced with massive, changing API sets.

Why it matters:

Existing tool-use approaches rely on small, hand-coded sets of APIs in prompts, which cannot scale to the millions of changing cloud APIs needed for real-world tasks
Hallucinating API calls leads to runtime errors and unreliable system behavior, preventing LLMs from acting as the primary interface to computing infrastructure
LLMs must adapt to frequent updates in API documentation without requiring full retraining

Concrete Example: When asked to 'Help me find an API to convert the spoken language... using Torch Hub', GPT-4 hallucinates a non-existent model (torch.hub.load('snakers4/silero-models', 'asr')), and Claude selects the wrong library (Torchaudio), whereas Gorilla correctly identifies 'silero_sst'.

Key Novelty

Retriever-Aware Fine-Tuning for Massive API Usage

Constructs APIBench, a comprehensive dataset of ML APIs (TorchHub, TensorHub, HuggingFace) with synthetic instructions generated via Self-Instruct
Fine-tunes LLaMA-7B on instruction-API pairs augmented with retrieved API documentation, teaching the model to parse documentation to answer user queries
Evaluates using AST (Abstract Syntax Tree) sub-tree matching to verify functional correctness rather than just string matching

Architecture

The end-to-end pipeline for Gorilla, covering dataset collection, Self-Instruct fine-tuning, and inference modes (zero-shot vs. retrieval-augmented).

Evaluation Highlights

Gorilla (0-shot) achieves 83.79% accuracy on TensorHub, significantly outperforming GPT-4 (18.20%) and Claude (9.19%)
Reduces hallucination errors to near zero (0% on TorchHub with Oracle retriever) compared to GPT-4's 36.55% hallucination rate in zero-shot settings
Adapts to test-time API changes (e.g., version updates) via retrieval-aware training, unlike static models that fail when documentation evolves

Breakthrough Assessment

9/10

Pioneering work in scaling LLM tool use to massive numbers of APIs via fine-tuning rather than just prompting. Establishes a major benchmark (APIBench) and demonstrates superior performance over GPT-4.

⚙️ Technical Details

Problem Definition

Setting: Given a natural language query and a large set of potential APIs, generate the correct executable API call (including arguments) that satisfies the query constraints.

Inputs: Natural language user prompt + (Optional) Retrieved API documentation JSON

Outputs: Executable Python code snippet invoking the correct API

Pipeline Flow

User Query → Retriever (BM25/GPT-Index) → Top-1 API Documentation
Concatenate: User Query + 'Use this API documentation...' + Retrieved Doc
Gorilla LLM (Fine-tuned LLaMA-7B) → Generated API Call

System Modules

Retriever

Fetch the most relevant API documentation from the database based on the user query

Model or implementation: BM25 or GPT-Index (text-davinci-003)

Gorilla

Generate the specific Python code to invoke the API using the retrieved documentation

Model or implementation: LLaMA-7B (Fine-tuned)

Novel Architectural Elements

Integration of retriever output directly into the instruction-tuning prompt structure to enable adaptation to documentation updates without retraining

Modeling

Base Model: LLaMA-7B

Training Method: Standard instruction fine-tuning

Objective Functions:

Purpose: Minimize the difference between generated API calls and ground truth code.

Formally: Standard Cross-Entropy Loss on tokens.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of LLaMA-7B

Training Data:

1,645 API calls from TorchHub, TensorHub, HuggingFace
Synthetic instructions generated via Self-Instruct (GPT-4)
10 instruction-API pairs per API -> ~16,450 total data points
Split: 90/10 train/eval for HuggingFace, 80/20 for others

Key Hyperparameters:

epochs: 5
learning_rate: 2e-5
scheduler: cosine decay
+ 1 more
batch_size: Not reported in the paper

Compute: Fine-tuned on 8x A100 GPUs (40GB memory each)

Comparison to Prior Work

vs. Toolformer: Focuses on massive, overlapping ML APIs (1600+) rather than a small set of generic tools (calculator, wiki)
vs. GPT-4: Fine-tunes explicitly for API generation with retrieval, significantly reducing hallucination compared to zero-shot prompting
vs. TaskMatrix.ai [not cited in paper]: Focuses on ML model APIs specifically using AST matching for verification, rather than general task completion

Limitations

Dataset focused exclusively on Machine Learning APIs (TorchHub, TensorHub, HuggingFace), potentially biasing results towards this domain
Performance heavily dependent on retriever quality; poor retrieval (BM25) can degrade performance below zero-shot
AST matching metric checks functional equivalence but does not execute code to verify runtime success

Reproducibility

Code: https://gorilla.cs.berkeley.edu

publicly available (https://gorilla.cs.berkeley.edu). Code, model weights, and the APIBench dataset are released. The paper details the dataset collection and filtering process (e.g., top 20 models per domain for HuggingFace).

📊 Experiments & Results

Evaluation Setup

Generate API calls for natural language queries across three domains: TorchHub, TensorHub, and HuggingFace.

Benchmarks:

APIBench (API Call Generation) [New]

Metrics:

AST Accuracy (Functional Correctness)
Hallucination Rate (Invoking non-existent tools)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance (without retriever) demonstrates Gorilla's superior ability to internalize API knowledge compared to general-purpose LLMs.
TensorFlow Hub (0-shot)	AST Accuracy	18.20	83.79	+65.59
TorchHub (0-shot)	AST Accuracy	38.70	59.13	+20.43
HuggingFace (0-shot)	Hallucination Rate	19.80	10.95	-8.85
Retrieval-augmented performance shows that while retrievers help, Gorilla adapts best to the retrieved context.
TorchHub (Oracle Retriever)	AST Accuracy	66.12	67.20	+1.08
TorchHub	AST Accuracy	59.13	40.32	-18.81

Experiment Figures

Scatter plot comparing Accuracy vs. Hallucination for various models and retrieval settings.

Visual explanation of the AST Sub-Tree Matching evaluation metric.

Main Takeaways

Gorilla outperforms GPT-4 in generating accurate API calls, particularly for TensorHub and TorchHub, significantly reducing hallucination.
Retriever-aware training allows the model to adapt to API changes (version updates) at test time without retraining, a critical capability for maintaining relevance.
While retrieval generally helps, suboptimal retrievers (like BM25) can confuse the model and degrade performance compared to zero-shot inference, highlighting the importance of retriever quality.
The model demonstrates an ability to reason about constraints (e.g., parameter size, accuracy) when selecting APIs, though performance drops as constraints become more complex.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and fine-tuning
Familiarity with API documentation structures (JSON, arguments)
Knowledge of Information Retrieval (IR) concepts (BM25, dense retrieval)
Understanding of Abstract Syntax Trees (AST) for code analysis

Key Terms

AST sub-tree matching: An evaluation method that parses generated code into a tree structure and checks if the relevant function call and arguments exist as a sub-tree of the reference solution, ignoring irrelevant formatting.

Self-Instruct: A framework for generating synthetic instruction-following data by prompting a strong LLM (like GPT-4) with a few seed examples to create diverse task instances.

Hallucination (API): When an LLM generates an API call that does not exist in the database or uses a non-existent library, distinct from using an existing API incorrectly.

APIBench: A dataset constructed in this paper containing over 11,000 {instruction, API} pairs derived from TorchHub, TensorHub, and HuggingFace model cards.

Retriever-Aware training: Training the LLM with the retrieved API documentation explicitly included in the prompt, teaching it to rely on the provided context rather than memorized knowledge.

Zero-shot (in this context): Providing the LLM with only the user prompt and no retrieved documentation or in-context examples during inference.