Efficient and Scalable Estimation of Tool Representations in Vector Space

📝 Paper Summary

Multi-call tool use with fixed plan Retrieval

A two-stage tool retrieval system that replaces description-based embeddings with usage-based embeddings (Tool2Vec) and refines candidates using a multi-label classification model to improve LLM function calling.

Core Problem

LLMs cannot efficiently handle thousands of tools due to context window limits, and existing dense retrieval methods based on tool descriptions suffer from a semantic gap between user queries and technical descriptions.

Why it matters:

Passing thousands of function descriptions into an LLM's context window is often infeasible or prohibitively expensive
Existing retrieval methods rely on tool descriptions, which often fail to capture the nuances of how a user actually queries for a tool (the semantic gap)
Latency in real-time applications makes heavy reasoning-based retrieval (using an LLM to select tools) impractical

Concrete Example: A user asks 'What is Anna's email address?'. A description-based retriever might fail to link this to a tool named 'find_email_address' because the description 'returns the email address for the given name' is semantically distant from the query. Tool2Vec bridges this by embedding the example query itself as the tool's representation.

Key Novelty

Tool2Vec + ToolRefiner (Two-Stage Usage-Driven Retrieval)

Replaces tool description embeddings with 'usage-driven' embeddings (Tool2Vec) derived from averaging embeddings of example user queries associated with each tool
Implements a retrieve-then-refine pipeline where a fast initial retriever prunes the search space, followed by a fine-tuned classifier (ToolRefiner) that considers tool-query interactions

Architecture

The two-stage retrieval pipeline. Stage 1 (Fast Retriever) prunes tools using Tool2Vec or MLC. Stage 2 (ToolRefiner) refines the set using a fine-tuned encoder that takes the query and Tool2Vec embeddings as input.

Evaluation Highlights

+27.28 Recall@3 improvement on the ToolBench dataset compared to the standard description-based ToolBench retriever
+30.5 Recall@3 improvement on the newly created ToolBank dataset compared to description-based baselines
Achieves higher recall than the COLT retriever baseline on ToolBench I2 and I3 subsets when using ToolRefiner + MLC

Breakthrough Assessment

7/10

Significant performance gains over standard description-based retrieval by shifting to usage-based embeddings. Practical two-stage architecture. Limited by reliance on having query data for every tool.

⚙️ Technical Details

Problem Definition

Setting: Dense retrieval of relevant tools T from a large pool based on a user query q

Inputs: Natural language user query q

Outputs: A subset of relevant tools T_subset

Pipeline Flow

Input Query Processing
First Stage Retrieval (Tool2Vec or MLC) -> Pruned Candidates
Second Stage Refinement (ToolRefiner) -> Final Tool Set

System Modules

Tool2Vec Retriever (Retrieval)

Generate initial candidate tools using usage-based embeddings

Model or implementation: Fine-tuned E5-base

MLC Retriever (Retrieval)

Alternative first-stage retriever using multi-label classification

Model or implementation: DeBERTa-V3 base

ToolRefiner

Refine the candidate list by classifying relevance based on query-tool interaction

Model or implementation: DeBERTa-V3 xsmall

Novel Architectural Elements

Usage-driven embedding generation (Tool2Vec) where tool vectors are centroids of query clusters rather than description embeddings
ToolRefiner architecture that inputs pre-computed Tool2Vec embeddings alongside query text into a DeBERTa encoder for re-ranking

Modeling

Base Model: DeBERTa-V3 (base for MLC, xsmall for ToolRefiner)

Training Method: Supervised Fine-Tuning (Classification Loss)

Objective Functions:

Purpose: Optimize the classifier to distinguish relevant tools.

Formally: Binary Cross Entropy loss (implied by multi-label classification description).

Adaptation: Full fine-tuning

Trainable Parameters: Not reported in the paper

Training Data:

ToolBank: 8:2 training/validation split
ToolBench: Standard splits I1, I2, I3

Key Hyperparameters:

embedding_model: E5-base
MLC_model: DeBERTa-V3-base (86M params)
ToolRefiner_model: DeBERTa-V3-xsmall (22M params)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench Retriever: Tool2Vec uses query-based embeddings instead of description-based embeddings to bridge the semantic gap
vs. ToolRerank: ToolRefiner processes all candidates in one forward pass (faster) and uses Tool2Vec embeddings, whereas ToolRerank processes tools individually using descriptions
vs. AnyTool: Tool2Vec/Refiner uses small specialized models (DeBERTa-xsmall) rather than large, high-latency LLMs like GPT-4
+ 1 more
vs. EasyTool [not cited in paper]: EasyTool rewrites tool descriptions; Tool2Vec ignores descriptions entirely in favor of usage examples

Limitations

Relies on the availability of domain-specific tool retrieval data (queries associated with tools)
Performance on datasets with complex data types (e.g., PandasBank) is lower than on simpler domains
Tool2Vec embeddings require pre-computation based on existing query logs or synthetic data

Reproducibility

Code: https://github.com/SqueezeAILab/Tool2Vec

Code is publicly available at https://github.com/SqueezeAILab/Tool2Vec. ToolBank dataset available at Hugging Face. COLT baseline results taken from original paper as code was unavailable.

📊 Experiments & Results

Evaluation Setup

Tool retrieval given a natural language query

Benchmarks:

ToolBench (Real-world API retrieval (Subsets I1, I2, I3 with increasing complexity))
ToolBank (Domain-specific tool retrieval (Numpy, Pandas, AWS)) [New]

Metrics:

Recall@K (K=3, 5, 7)
nDCG@K
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison on ToolBench dataset shows ToolRefiner approaches outperforming standard retrievers.
ToolBench I1	Recall@5	90.19	96.83	+6.64
ToolBench I3	Recall@3	54.07	81.35	+27.28
ToolBench I3	Recall@5	85.50	87.80	+2.30
Results on the new ToolBank dataset (NumpyBank, PandasBank, AWSBank).
NumpyBank	Recall@3	50.82	73.82	+23.00
AWSBank	Recall@3	41.92	72.42	+30.50

Experiment Figures

t-SNE visualization and cosine similarity analysis comparing Tool2Vec embeddings vs. Description embeddings.

Comparison of query naturalness between ToolBank (polished) and ToolBench.

Main Takeaways

Tool2Vec (usage-based embeddings) consistently outperforms description-based embeddings by bridging the semantic gap between query and tool representation.
The two-stage approach (ToolRefiner) provides additive gains over single-stage retrieval, especially when combined with MLC or Tool2Vec.
LLM-generated datasets (ToolBank) can successfully train small specialized models (DeBERTa-xsmall) to outperform larger generalist baselines.
MLC (Multi-Label Classification) alone is a surprisingly strong baseline, often outperforming dense retrieval methods on ToolBench.

📚 Prerequisite Knowledge

Prerequisites

Understanding of dense retrieval and vector embeddings
Familiarity with Function Calling/Tool Use in LLMs
Basic knowledge of multi-label classification

Key Terms

Tool2Vec: A method of representing tools by averaging the embeddings of user queries that successfully use those tools, rather than embedding the tool's text description

MLC: Multi-Label Classification—framing tool retrieval as predicting a binary vector where 1 indicates a tool is relevant and 0 indicates it is not

Recall@K: A metric measuring the proportion of relevant items found in the top-K retrieved results

ToolRefiner: A second-stage model that takes the query and the embeddings of tools retrieved in the first stage to perform a more accurate binary classification of relevance

ToolBank: A new domain-specific tool retrieval dataset created by the authors using LLMs to generate natural queries and enforce tool co-occurrence

DeBERTa: Decoding-enhanced BERT with disentangled attention—a transformer model used here as the backbone for the classification and refinement tasks

dense retrieval: Finding relevant items by comparing vector representations (embeddings) of queries and items, typically using cosine similarity

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that considers the position of relevant items in the list