Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases

📝 Paper Summary

Tool Retrieval Agentic RAG pipeline

Toolshed adapts advanced document retrieval techniques—such as query decomposition, synthetic question augmentation, and reranking—to enable LLM agents to accurately select tools from libraries of thousands without fine-tuning.

Core Problem

LLM Agents struggle to select the correct tools from large libraries (e.g., >1000 tools) because simple semantic matching fails on complex queries, and model context limits prevent loading all tool definitions.

Why it matters:

Scaling agents to enterprise tasks (e.g., secure database interactions) requires access to thousands of specialized tools, exceeding the typical 128-tool API limit of providers like OpenAI
Current retrievers rely on basic tool names/descriptions, which lack the semantic depth to match vague user intents or multi-step reasoning needs
Fine-tuning models for tool selection is expensive and brittle; inference-time solutions are needed for adaptability

Concrete Example: A user asks 'What is a neural network?'. A simple retriever might miss relevant tools because the query is abstract. The proposed system expands this into diverse intents (research, web search, educational course), retrieving tools for each specific angle.

Key Novelty

Advanced RAG-Tool Fusion

Treats tool selection as an Advanced RAG (Retrieval-Augmented Generation) problem rather than a classification task
Enhances tool indexing by appending synthetic questions and argument schemas to vector embeddings (Pre-retrieval)
Transforms user queries via decomposition and multi-query expansion to cast a wider semantic net before filtering results with an LLM reranker (Intra/Post-retrieval)

Architecture

The three-phase Advanced RAG-Tool Fusion pipeline: Pre-retrieval (indexing), Intra-retrieval (query processing), and Post-retrieval (reranking).

Evaluation Highlights

Achieves 98.67% Recall@5 on the Seal-Tools benchmark, outperforming the previous state-of-the-art Seal-Tools retriever (57.19%) by over 41 percentage points
Outperforms Re-Invoke by 9.09% on the ToolE Multi-tool benchmark (92.51% vs 83.42%), demonstrating superior handling of multi-step tasks
Maintains near-100% retrieval accuracy even when scaling the tool library size (tool-M) from 100 to 4,000, whereas baseline performance degrades significantly

Breakthrough Assessment

7/10

Strong empirical results on scaling tool retrieval without fine-tuning. While it aggregates existing RAG techniques, applying them rigorously to the tool-selection domain addresses a critical bottleneck for agent deployment.

⚙️ Technical Details

Problem Definition

Setting: Retrieving the correct subset of tools (top-k) from a large tool library (size M) that are necessary to solve a user query Q, such that the Agent can successfully execute the task.

Inputs: Natural language user query Q and a large set of available tool definitions

Outputs: A filtered list of k tool definitions (JSON schemas) relevant to Q

Pipeline Flow

Indexing Group: Tool Definition -> Augmentation (Schema/Synthetic Qs) -> Vector Store
Retrieval Group: User Query -> Decomposition & Expansion -> Vector Search -> Reranking

System Modules

Tool Indexer

Enhance tool representations before storage

Model or implementation: Azure OpenAI text-embedding-3-large

Query Planner (Retrieval (Intra-retrieval))

Break complex queries into logical sub-steps

Model or implementation: Azure OpenAI GPT-4o

Query Expander (Retrieval (Intra-retrieval))

Generate variations of queries to capture different intents

Model or implementation: Azure OpenAI GPT-4o

Reranker

Filter and rank the retrieved tools to select the final top-k

Model or implementation: Azure OpenAI GPT-4o (LLM-based reranker)

Novel Architectural Elements

Application of 'Reverse HyDE' (synthetic questions) specifically to Tool Definitions in vector stores
Ensemble of query decomposition AND multi-query expansion applied prior to tool retrieval

Modeling

Base Model: Azure OpenAI GPT-4o (2024-05-13)

📊 Experiments & Results

Evaluation Setup

Tool retrieval accuracy tested on standard benchmarks

Benchmarks:

Seal-Tools (Large-scale tool retrieval (~4000 tools))
ToolE (Single and Multi-hop tool retrieval (~200 tools))

Metrics:

Recall@k (k=1, 5, 10)
Tool Calling Accuracy (Name, Parameter Key, Parameter Value)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval performance comparisons against baselines on standard tool benchmarks.
Seal-Tools	Recall@5	57.19	98.67	+41.48
ToolE (Multi-Tool)	Recall@5	83.42	92.51	+9.09
ToolE (Single-Tool)	Recall@5	89.10	94.02	+4.92
ToolE (Single-Tool)	Recall@5	48.12	94.02	+45.90

Experiment Figures

Comparison of retrieval accuracy as the total number of tools (Tool-M) increases, comparing Seal-Tools DPR (Fig 4) vs Advanced RAG-Tool Fusion (Fig 5).

Main Takeaways

Advanced RAG-Tool Fusion significantly outperforms BM25 and specialized dense retrievers (Seal-Tools, Re-Invoke) across all benchmarks.
The system is highly robust to scaling: as the total number of tools (tool-M) increases to 4000, retrieval accuracy remains >95%, whereas baselines drop significantly.
Increasing the tool selection threshold (top-k) improves retrieval recall, but the method achieves high accuracy even at low k values (e.g., k=5), saving context window space.
The ensemble approach (decomposition + expansion + reranking) is critical for multi-hop queries where simple similarity search fails.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation) pipelines
Familiarity with Vector Databases and Embeddings
Knowledge of LLM Function Calling / Tool Use APIs (e.g., OpenAI Tools)

Key Terms

Advanced RAG-Tool Fusion: The paper's proposed ensemble method applying pre-, intra-, and post-retrieval optimization techniques to tool selection

Toolshed Knowledge Base: A vector database storing 'enhanced' tool documents (concatenating name, description, schema, synthetic questions) for retrieval

tool-M: The total number of tools available in the knowledge base (the search space size)

top-k: The number of tools retrieved and presented to the LLM agent's context window

Recall@k: A metric measuring the proportion of relevant tools found within the top-k retrieved results

HyDE: Hypothetical Document Embeddings—a RAG technique where the model generates a fake 'ideal' document (or question) to embed for better retrieval matching

BM25: Best Matching 25—a standard probabilistic information retrieval function that ranks documents based on term frequency and inverse document frequency

Zero-shot: Using a model to perform a task without any specific training examples or fine-tuning for that task