Tools are under-documented: Simple Document Expansion Boosts Tool Retrieval

📝 Paper Summary

Tool retrieval Tool profiling

The paper introduces a pipeline to enrich sparse tool documentation with structured fields (usage scenarios, limitations) and trains specialized dense retrievers and rerankers that significantly outperform baselines.

Core Problem

Tool retrieval fails because documentation is often incomplete, heterogeneous, and semantically misaligned with user queries, leading to a large semantic gap.

Why it matters:

Current benchmarks reveal 41.6% of tool documents lack clear functional statements or usage contexts, forcing LLMs to guess parameters.
Inconsistent phrasing (e.g., 7 ways to describe the same function) complicates retrieval for ambiguous user queries.
Prior work focuses on query expansion or architecture changes, overlooking the root cause: the flawed underlying documentation data.

Concrete Example: In the ToolRet dataset, the same function is described in seven distinct formulations across sources. Some datasets like 'mnms' lack even a basic description field, making it impossible for retrievers to match a user query like 'find a restaurant' to the correct API.

Key Novelty

Tool-DE (Tool-Document Expansion) Framework

Systematically enriches raw tool documentation using a low-cost LLM pipeline to generate structured fields: function description, 'when-to-use', limitations, and tags.
Creates specialized large-scale training corpora (50k for retrieval, 200k for reranking) based on these enriched documents.
Trains dedicated models (Tool-Embed and Tool-Rank) specifically optimized for the enriched document structure.

Architecture

The four-stage pipeline for constructing Tool-DE: Expansion, Judgement, Refinement, and Human Validation.

Evaluation Highlights

+10.23 NDCG@10 improvement by Tool-Embed-4B over the MTEB SoTA open-source model (Qwen3-Embedding-8B) on the Tool-DE benchmark.
Tool-Rank-4B achieves state-of-the-art performance with 56.44 NDCG@10, improving by +4.21 over the first-stage retriever.
Document expansion alone boosts zero-shot performance of sparse retrievers (BM25s) significantly on Recall@10 (+8.69).

Breakthrough Assessment

7/10

Strong practical contribution by addressing the data quality bottleneck in tool retrieval. The proposed pipeline and models show significant gains, though the core technique (LLM-based document expansion) is a known strategy applied to a new domain.

⚙️ Technical Details

Problem Definition

Setting: Retrieving the most relevant tools from a large repository given a user query, where tool documentation is sparse or incomplete.

Inputs: User query q and a repository of tool documents D

Outputs: Ranked list of tools relevant to q

Pipeline Flow

Data Pipeline: Raw Docs → LLM Expansion (Qwen3) → Verification (Llama-3) → Refinement (GPT-4o) → Enriched Corpus
Inference Pipeline: Query → Tool-Embed (Dense Retrieval) → Top-K Candidates → Tool-Rank (Reranker) → Final List

System Modules

Document Expander (Data Construction)

Generate structured fields (description, tags, when-to-use, limitations) from raw docs

Model or implementation: Qwen3-32B (Reasoning Mode)

Judgement & Refinement (Data Construction)

Verify generated fields for faithfulness and refine if necessary

Model or implementation: Llama-3.1-70B (Judge) and GPT-4o (Refiner)

Tool-Embed

Retrieve candidate tools using dense vector similarity

Model or implementation: Qwen3-Embedding-4B (fine-tuned)

Tool-Rank

Re-order candidates based on relevance probability

Model or implementation: Qwen3-Reranker-4B (LoRA fine-tuned)

Modeling

Base Model: Qwen3-Embedding-4B (Retriever), Qwen3-Reranker-4B (Reranker)

Training Method: Contrastive Learning (Retriever) and Cross-Entropy Classification (Reranker)

Objective Functions:

Purpose: Maximize similarity between query and positive tool, minimize vs negatives.

Formally: InfoNCE loss
Purpose: Classify query-document pairs as relevant/irrelevant.

Formally: Cross-entropy loss on 'true'/'false' tokens

Adaptation: Full-parameter tuning (Retriever); LoRA (Reranker)

Trainable Parameters: Full params for Tool-Embed; LoRA rank=32 for Tool-Rank

Training Data:

50k instances for retriever training
200k instances for reranker training
Derived from 35 tool-use datasets in ToolRet

Key Hyperparameters:

lora_rank: 32
lora_alpha: 64
dropout: 0.1
+ 2 more
epochs: 1
negative_samples: 5

Compute: Two NVIDIA A100 GPUs (80GB each)

Comparison to Prior Work

vs. ToolRet Baselines: Tool-DE enriches the documents themselves with structured fields (limitations, usage) rather than just indexing raw text.
vs. Query Expansion: Tool-DE expands the *documents* offline, avoiding inference-time latency of query expansion.
vs. General Retrievers (GritLM, E5): Tool-Embed is fine-tuned specifically on the enriched tool profiles.

Limitations

Relies on LLMs (Qwen, Llama, GPT-4o) for expansion, which may still hallucinate despite verification steps.
Expansion cost scales linearly with the size of the tool repository.
Performance depends on the quality of the base LLM used for expansion.

Reproducibility

Code: https://github.com/EIT-NLP/Tool-DE

Code released at https://github.com/EIT-NLP/Tool-DE. Training data (Tool-Embed-Train, Tool-Rank-Train) released. Base models (Qwen3) are open weights. Refinement step uses closed-source GPT-4o but only for ~1.5% of data.

📊 Experiments & Results

Evaluation Setup

Retrieve relevant tools from a corpus of ~43k tools given user queries.

Benchmarks:

Tool-DE (Tool Retrieval (Expanded Docs)) [New]
ToolRet (Tool Retrieval (Raw Docs))

Metrics:

NDCG@10
Recall@10
Completeness@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Tool-DE benchmark showing Tool-Embed-4B outperforms general-purpose embeddings and non-expanded baselines.
Tool-DE	NDCG@10	46.21	52.23	+6.02
Tool-DE	Recall@10	57.52	63.13	+5.61
Reranking results show further improvements when applying Tool-Rank on top of retrieval results.
Tool-DE	NDCG@10	52.23	56.44	+4.21
Ablation study demonstrates the impact of training on expanded documents versus original documents.
Tool-DE	NDCG@10	46.80	52.23	+5.43

Experiment Figures

Comparison of reranker performance on ToolRet (original) vs. Tool-DE (expanded) benchmarks.

Main Takeaways

Document expansion universally improves retrieval performance for both sparse (BM25s) and dense models.
Training specifically on expanded data (Tool-Embed) yields larger gains than just using expanded data at inference time with generic models.
The 'when-to-use' and 'limitations' fields are particularly valuable for disambiguating tools with similar functions.
Reranking with Tool-Rank provides substantial gains over pure retrieval, effectively leveraging the richer context in expanded profiles.

📚 Prerequisite Knowledge

Prerequisites

Information Retrieval (dense retrieval vs. sparse retrieval)
Reranking methodologies
Tool use in LLMs (API calling)

Key Terms

NDCG@K: Normalized Discounted Cumulative Gain at K—a measure of ranking quality that considers the position of relevant items

Tool-DE: Tool-Document Expansion—the proposed benchmark and framework for enriching tool documentation

InfoNCE loss: A contrastive loss function used to train dense retrievers by maximizing similarity between positive pairs and minimizing it for negatives

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

Hard negatives: Retrieved items that are irrelevant but semantically similar to the query, used to train models to distinguish subtle differences