← Back to Paper List

Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases

Elias Lumer, Vamse Kumar Subbiah, James A. Burke, Pradeep Honaganahalli Basavaraju, Austin Huber
Innovation Hub, PricewaterhouseCoopers
arXiv (2024)
RAG Agent Benchmark

📝 Paper Summary

Tool Retrieval Agentic RAG pipeline
Toolshed adapts advanced document retrieval techniques—such as query decomposition, synthetic question augmentation, and reranking—to enable LLM agents to accurately select tools from libraries of thousands without fine-tuning.
Core Problem
LLM Agents struggle to select the correct tools from large libraries (e.g., >1000 tools) because simple semantic matching fails on complex queries, and model context limits prevent loading all tool definitions.
Why it matters:
  • Scaling agents to enterprise tasks (e.g., secure database interactions) requires access to thousands of specialized tools, exceeding the typical 128-tool API limit of providers like OpenAI
  • Current retrievers rely on basic tool names/descriptions, which lack the semantic depth to match vague user intents or multi-step reasoning needs
  • Fine-tuning models for tool selection is expensive and brittle; inference-time solutions are needed for adaptability
Concrete Example: A user asks 'What is a neural network?'. A simple retriever might miss relevant tools because the query is abstract. The proposed system expands this into diverse intents (research, web search, educational course), retrieving tools for each specific angle.
Key Novelty
Advanced RAG-Tool Fusion
  • Treats tool selection as an Advanced RAG (Retrieval-Augmented Generation) problem rather than a classification task
  • Enhances tool indexing by appending synthetic questions and argument schemas to vector embeddings (Pre-retrieval)
  • Transforms user queries via decomposition and multi-query expansion to cast a wider semantic net before filtering results with an LLM reranker (Intra/Post-retrieval)
Architecture
Architecture Figure Figure 2
The three-phase Advanced RAG-Tool Fusion pipeline: Pre-retrieval (indexing), Intra-retrieval (query processing), and Post-retrieval (reranking).
Evaluation Highlights
  • Achieves 98.67% Recall@5 on the Seal-Tools benchmark, outperforming the previous state-of-the-art Seal-Tools retriever (57.19%) by over 41 percentage points
  • Outperforms Re-Invoke by 9.09% on the ToolE Multi-tool benchmark (92.51% vs 83.42%), demonstrating superior handling of multi-step tasks
  • Maintains near-100% retrieval accuracy even when scaling the tool library size (tool-M) from 100 to 4,000, whereas baseline performance degrades significantly
Breakthrough Assessment
7/10
Strong empirical results on scaling tool retrieval without fine-tuning. While it aggregates existing RAG techniques, applying them rigorously to the tool-selection domain addresses a critical bottleneck for agent deployment.
×