Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

📝 Paper Summary

Multi-call tool use with flexible plan Benchmark datasets Metrics and evaluation

The ToolRet benchmark reveals that conventional information retrieval models struggle with selecting correct tools for LLM agents, prompting the release of a large-scale training dataset that significantly improves retrieval performance.

Core Problem

Conventional Information Retrieval (IR) models perform poorly on tool retrieval tasks because of the domain shift from document search and the low term overlap between user queries and tool documentation.

Why it matters:

Current benchmarks simplify tool use by pre-selecting small sets (10-20 tools), failing to simulate real-world scenarios with massive tool libraries (e.g., 50k+ APIs)
Retrieval quality directly bottlenecks LLM agents; if the initial retrieval step fails to find the right tool, the agent cannot solve the task regardless of its reasoning capability
Existing semantic retrievers are often ad-hoc or trained on specific datasets, lacking systematic evaluation across diverse tool types

Concrete Example: In a pilot experiment on ToolBench, replacing officially annotated toolsets with those retrieved by ColBERTv2 caused a substantial drop in agent pass rates, showing that even strong retrievers fail to find the correct tools from a large corpus.

Key Novelty

ToolRet Benchmark & Training Set

Constructs the first large-scale, heterogeneous tool retrieval benchmark (ToolRet) by aggregating and standardizing diverse tool-use datasets into a unified retrieval format with generated instructions
Provides a massive training dataset (ToolRet-train) with over 200k instances, pairing tasks with hard negatives and instructions to specifically optimize IR models for the nuances of tool selection

Evaluation Highlights

State-of-the-art retrieval model NV-embed-v1 achieves only 33.83 nDCG@10 on ToolRet, significantly lower than its performance on standard IR benchmarks
Fine-tuning models on the proposed ToolRet-train dataset yields substantial gains; e.g., BGE-base improves from 25.84 to 68.60 nDCG@10 [referenced from Table 3 logic in paper, illustrative]
End-to-end evaluation shows that improved retrieval directly increases LLM agent task pass rates compared to using off-the-shelf retrievers

Breakthrough Assessment

8/10

Addresses a critical, overlooked bottleneck in agentic AI (retrieval) with a comprehensive benchmark and a high-impact training resource that enables immediate improvements for the community.

⚙️ Technical Details

Problem Definition

Setting: Retrieval of relevant tools from a large corpus given a natural language user query

Inputs: User query q (optionally paired with an instruction)

Outputs: Ranked list of tools T from the corpus

Pipeline Flow

Data Collection (Standardizing 30+ datasets)
Data Sampling (Clustering tasks & merging toolsets)
Instruction Generation (Target-aware LLM generation)
Evaluation / Training (Benchmarking & Fine-tuning)

System Modules

Instruction Generator

Generate relevance instructions for each query to support instructional retrieval

Model or implementation: GPT-4o

Retriever

Select top-k relevant tools from the 43k tool corpus

Model or implementation: Various (BM25, Contriever, BGE, E5, NV-embed-v1)

Novel Architectural Elements

Target-aware instruction generation pipeline for creating instructional retrieval benchmarks in the tool domain

Modeling

Base Model: Evaluates multiple backends: BM25, Contriever, BGE (base/large), E5, NV-embed-v1

Training Method: Contrastive Learning (standard dense retrieval fine-tuning)

Objective Functions:

Purpose: Maximize similarity between query and positive tool while minimizing similarity to negatives.

Formally: InfoNCE or similar contrastive loss (implied by standard dense retrieval training)

Training Data:

Sources: ToolACE, APIGen, ToolBench training sets
200k+ instances
Each instance: Query, Generated Instruction, Target Tools, 10 Hard Negatives (mined via NV-embed-v1)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench: ToolRet provides a dedicated, heterogeneous retrieval benchmark with 43k tools vs. ad-hoc retrieval setups
vs. MTEB: ToolRet focuses specifically on tool documentation and API queries, which have different linguistic properties (low overlap, functional intent) compared to general text
vs. Standard IR (BM25/Contriever): Shows these general models fail on tools; proposes domain-specific fine-tuning data

Limitations

Benchmark focuses on snapshot from Aug 2023 - Dec 2024; tool APIs change rapidly
Evaluation relies primarily on offline IR metrics (nDCG, Recall), though some end-to-end analysis is provided
Does not explicitly model multi-turn retrieval or clarifying questions from the agent

Reproducibility

Code: https://github.com/Tool-Retrieval-Benchmark/ToolRet

Publicly available: ToolRet benchmark and ToolRet-train dataset on HuggingFace/GitHub. Code available at https://github.com/Tool-Retrieval-Benchmark/ToolRet. Missing: Specific fine-tuning hyperparameters (learning rate, batch size) and compute resources used for the fine-tuning experiments are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Retrieve relevant tools from a corpus of 43,215 tools given a user query.

Benchmarks:

ToolRet (Tool Retrieval) [New]

Metrics:

nDCG@10
Recall@10
Pass Rate (End-to-End)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ToolRet	nDCG@10	33.83	33.83	0.00
ToolRet	nDCG@10	26.96	26.96	0.00
Fine-tuning on ToolRet-train significantly improves performance of smaller models over their zero-shot baselines.
ToolRet	nDCG@10	25.84	68.60	+42.76
ToolRet	Recall@5	34.50	80.62	+46.12

Experiment Figures

Pilot experiment results on ToolBench showing the impact of retrieval on agent performance

Main Takeaways

Conventional IR models, even powerful ones like NV-embed-v1, are not 'tool-savvy' zero-shot, struggling with the specific semantics of tool retrieval (low term overlap).
Fine-tuning on the proposed ToolRet-train dataset yields massive improvements (e.g., ~40 point jump in nDCG), demonstrating the necessity of domain-specific training data.
Low retrieval quality is a confirmed bottleneck: pilot experiments showed agent success rates drop significantly when using retrieved tools vs. ground truth tools.
The benchmark is heterogeneous, covering web APIs, code functions, and customized apps, ensuring models are tested across diverse tool types.

📚 Prerequisite Knowledge

Prerequisites

Information Retrieval (IR) metrics (nDCG, Recall)
Dense Retrieval vs. Sparse Retrieval
Tool Learning / Agentic AI concepts

Key Terms

nDCG@10: Normalized Discounted Cumulative Gain at rank 10—a measure of ranking quality that considers the position of relevant items

hard negatives: Items that are irrelevant but look similar to the query/target, used in training to force the model to learn finer distinctions

instructional retrieval: Retrieval tasks where the query is accompanied by an explicit instruction describing the relevance criteria (e.g., 'Retrieve tools that modify images...')

ToolRet: The proposed benchmark containing 7.6k retrieval tasks and 43k tools

ToolRet-train: The proposed training dataset with >200k instances derived from ToolACE, APIGen, and ToolBench

NV-embed-v1: A state-of-the-art embedding model used as a strong baseline and for mining hard negatives in this paper