Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems

📝 Paper Summary

Multi-Agent Systems Tool Retrieval Agentic RAG pipeline

Tool-to-Agent Retrieval embeds tools and their parent agents in a shared vector space, allowing a single retrieval step to identify the best agent bundle even when the query matches only specific tool functionality.

Core Problem

Existing retrieval methods for multi-agent systems either match queries against coarse agent descriptions (missing fine-grained capabilities) or against individual tools (losing necessary agent context and coordination).

Why it matters:

Agent-first pipelines hide relevant tools if the parent agent description is too generic or brief.
Tool-only retrieval ignores the benefits of equipping a cohesive 'bundle' of tools (an agent) necessary for multi-step workflows.
Sending all tool definitions to an LLM is cost-prohibitive (e.g., one server with 26 tools consumes >4,600 tokens).

Concrete Example: A user asks for 'code analysis'. An 'agent-first' retriever might miss a 'Python Utility Agent' if its description is vague, while a 'tool-only' retriever might find a 'syntax checker' tool but fail to load the necessary authentication or companion debugging tools provided by the parent agent.

Key Novelty

Unified Tool-to-Agent Indexing & Traversal

Embeds both individual tools and parent agents in the same vector database, rather than keeping them in separate hierarchical indices.
Uses metadata links to map retrieved tools back to their parent agents, allowing the system to surface the correct agent bundle even if the match was found at the fine-grained tool level.

Architecture

Conceptual diagram of Tool-to-Agent Retrieval. It shows how the User Query is matched against a Unified Index containing both Tools and Agents. Matches are then resolved via Metadata Links to identify the Top-K Agents.

Evaluation Highlights

+19.4% improvement in Recall@5 compared to MCPZero (state-of-the-art agent retriever) on LiveMCPBench.
+17.7% improvement in nDCG@5 compared to MCPZero on LiveMCPBench.
Consistently outperforms baselines across 8 different embedding models (e.g., +28% Recall@5 gain with Amazon Titan v2).

Breakthrough Assessment

7/10

Strong practical improvement for the specific problem of routing in large multi-agent ecosystems (MCP). While the architectural change is straightforward (unified index + aggregation), the consistent empirical gains across many models validate its effectiveness.

⚙️ Technical Details

Problem Definition

Setting: Retrieving the top-K relevant agents (MCP servers) for a given query from a catalog containing both agents and their constituent tools.

Inputs: Natural language query q (or decomposed sub-steps)

Outputs: Ranked list of K unique executable agents

Pipeline Flow

Indexing: Build unified catalog of Tools and Agents
Retrieval: Search top-N entities (tools or agents) using query
Aggregation: Map entities to parent Agents and select top-K unique

System Modules

Unified Indexer

Creates a single vector index containing embeddings for both Tool descriptions and Agent descriptions

Model or implementation: Various Embedding Models (e.g., OpenAI text-embedding-3-small, Amazon Titan v2)

Hybrid Retriever

Retrieves the top-N most semantically relevant entities (tools or agents) for a query

Model or implementation: Dense Vector Search + BM25 (Lexical Search)

Agent Aggregator

Resolves retrieved entities to their parent agents and deduplicates to form final ranked list

Model or implementation: Deterministic Algorithm 1

Novel Architectural Elements

Unified Vector Space for Heterogeneous Granularity: Embedding fine-grained tools and coarse-grained agents in the same index to allow entry at either level of abstraction
Retrieval-time Metadata Traversal: Resolving tool hits to agent bundles dynamically during ranking rather than committing to a fixed hierarchy beforehand

Modeling

Base Model: Evaluated across 8 embedding models: text-embedding-004 (Vertex), text-embedding-preview-0409 (Gemini), amazon.titan-embed-text-v1, amazon.titan-embed-text-v2:0, text-embedding-3-small/large (OpenAI), all-MiniLM-L6-v2, all-mpnet-base-v2

Compute: Not reported in the paper

Comparison to Prior Work

vs. ScaleMCP: ScaleMCP commits to an agent first, potentially missing relevant agents with poor descriptions; Tool-to-Agent retrieves at both levels simultaneously.
vs. MCPZero: MCPZero relies on high-level agent descriptions; Tool-to-Agent leverages granular tool semantics to boost agent recall.
vs. Tool-only: Tool-only ignores the execution context (auth, dependencies) of the parent agent; Tool-to-Agent maps back to the agent to ensure executability.

Limitations

Evaluation is limited to the LiveMCPBench dataset; performance on other tool-use benchmarks is not reported.
The method increases index size by storing embeddings for every tool, not just every agent (though N tools is usually manageable).
Requires explicit metadata linking tools to agents; may not apply to unstructured tool collections without clear ownership.
No cost/latency analysis provided for the increased retrieval search space (checking N >> K items).

Reproducibility

The paper uses the LiveMCPBench dataset and standard embedding models. Algorithm 1 is explicitly described. Code is not provided, but the method relies on standard vector database operations and metadata filtering.

📊 Experiments & Results

Evaluation Setup

Retrieving correct MCP servers for multi-step queries decomposed into sub-steps.

Benchmarks:

LiveMCPBench (Agent/Tool Retrieval for Real-world Questions)

Metrics:

Recall@5
nDCG@5
Mean Average Precision (mAP)
Statistical methodology: Standard deviation reported across embedding models to demonstrate stability.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on LiveMCPBench using OpenAI text-embedding-3-small embeddings. Tool-to-Agent consistently outperforms baselines.
LiveMCPBench	Recall@5	0.77	0.91	+0.14
LiveMCPBench	nDCG@5	0.62	0.74	+0.12
Generalization across different embedding models. The method shows robust improvements regardless of the underlying embedding architecture.
LiveMCPBench	Recall@5	0.66	0.85	+0.19
LiveMCPBench	Recall@5	0.68	0.77	+0.09

Main Takeaways

Unified indexing significantly improves retrieval accuracy (Recall and nDCG) by bridging the semantic gap between specific tool functions and broad agent descriptions.
The approach is robust across embedding models, showing consistent gains with both state-of-the-art proprietary models (OpenAI, Vertex) and smaller open-source models (MiniLM).
Analysis of retrieved items shows a balanced mix of tool and agent matches (approx 34% tool-based, 39% agent-based), confirming that both levels of granularity are necessary for optimal routing.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Vector Embeddings and Semantic Search
Concept of Multi-Agent Systems and Tool Use

Key Terms

MCP: Model Context Protocol—a standard that enables AI assistants to discover and connect to external data and tools via standardized servers

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents or tools

Recall@K: A metric measuring the proportion of relevant items found in the top K retrieved results

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that credits algorithms for placing relevant items higher in the list

mAP: Mean Average Precision—a metric that summarizes precision across different recall levels

BM25: Best Matching 25—a probabilistic information retrieval function based on term frequency and document length (keyword matching)

Dense Retrieval: Searching for documents using vector embeddings that capture semantic meaning, rather than just keyword matching

Step-wise Querying: Decomposing a complex user request into sequential sub-tasks and performing retrieval for each step independently

Context Dilution: The loss of specific details (like individual tool functions) when summarizing a large group of items into a single broad description