← Back to Paper List

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Shandong University, Baidu Inc., Leiden University
arXiv (2025)
Agent Benchmark

📝 Paper Summary

Multi-call tool use with flexible plan Benchmark datasets Metrics and evaluation
The ToolRet benchmark reveals that conventional information retrieval models struggle with selecting correct tools for LLM agents, prompting the release of a large-scale training dataset that significantly improves retrieval performance.
Core Problem
Conventional Information Retrieval (IR) models perform poorly on tool retrieval tasks because of the domain shift from document search and the low term overlap between user queries and tool documentation.
Why it matters:
  • Current benchmarks simplify tool use by pre-selecting small sets (10-20 tools), failing to simulate real-world scenarios with massive tool libraries (e.g., 50k+ APIs)
  • Retrieval quality directly bottlenecks LLM agents; if the initial retrieval step fails to find the right tool, the agent cannot solve the task regardless of its reasoning capability
  • Existing semantic retrievers are often ad-hoc or trained on specific datasets, lacking systematic evaluation across diverse tool types
Concrete Example: In a pilot experiment on ToolBench, replacing officially annotated toolsets with those retrieved by ColBERTv2 caused a substantial drop in agent pass rates, showing that even strong retrievers fail to find the correct tools from a large corpus.
Key Novelty
ToolRet Benchmark & Training Set
  • Constructs the first large-scale, heterogeneous tool retrieval benchmark (ToolRet) by aggregating and standardizing diverse tool-use datasets into a unified retrieval format with generated instructions
  • Provides a massive training dataset (ToolRet-train) with over 200k instances, pairing tasks with hard negatives and instructions to specifically optimize IR models for the nuances of tool selection
Evaluation Highlights
  • State-of-the-art retrieval model NV-embed-v1 achieves only 33.83 nDCG@10 on ToolRet, significantly lower than its performance on standard IR benchmarks
  • Fine-tuning models on the proposed ToolRet-train dataset yields substantial gains; e.g., BGE-base improves from 25.84 to 68.60 nDCG@10 [referenced from Table 3 logic in paper, illustrative]
  • End-to-end evaluation shows that improved retrieval directly increases LLM agent task pass rates compared to using off-the-shelf retrievers
Breakthrough Assessment
8/10
Addresses a critical, overlooked bottleneck in agentic AI (retrieval) with a comprehensive benchmark and a high-impact training resource that enables immediate improvements for the community.
×