Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline

Re-Invoke improves zero-shot tool retrieval by using LLMs to enrich tool documents with synthetic queries and decompose complex user requests into specific intents before matching.

Core Problem

Accurately retrieving relevant tools from large, evolving toolsets is difficult because user queries are often ambiguous or verbose, and tool documentation is frequently vague or incomplete.

Why it matters:

LLMs have token limits that prevent including all available tools in the prompt context
Maintaining labeled datasets for constantly changing tool pools is impractical, making supervised training difficult
Ambiguous user contexts often lead standard retrievers to select irrelevant tools based on superficial similarities

Concrete Example: A user asks about 'improving French skills' while 'planning a trip to France'. A standard retriever selects a 'travel_assistant' tool because of the 'France' keyword, missing the actual intent which requires a language learning tool.

Key Novelty

Unsupervised Multi-View Tool Retrieval

Enhances tool documents offline by generating diverse synthetic user queries that the tool could answer, bridging the gap between technical descriptions and user language
Extracts clean, tool-specific intents from verbose user queries at inference time to remove irrelevant context
Ranks tools by aggregating similarity scores across multiple views (intents) rather than a single vector match

Architecture

Overview of the Re-Invoke pipeline showing the offline indexing phase and online retrieval phase.

Evaluation Highlights

Achieves 20% relative improvement in nDCG@5 for single-tool retrieval on ToolE datasets compared to state-of-the-art baselines
Achieves 39% relative improvement in nDCG@5 for multi-tool retrieval on ToolE datasets
Outperforms supervised retrievers (ToolLLM) in end-to-end agent pass rates on ToolBench without requiring any training data

Breakthrough Assessment

7/10

Strong practical contribution for zero-shot scenarios. Effectively solves the 'vague documentation' problem without training, though the core techniques (query expansion/rewriting) are known in general IR.

⚙️ Technical Details

Problem Definition

Setting: Retrieve the top-k most relevant tool documents D from a large pool given a user query Q, without using labeled query-tool pairs for training.

Inputs: User query Q, List of tool documents D

Outputs: List of top-k retrieved tool documents

Pipeline Flow

Offline: Query Generator (Enrich tool docs with synthetic queries)
Online: Intent Extractor (Parse user query into specific intents)
Online: Multi-view Similarity Ranking (Match intents to enriched docs)

System Modules

Query Generator (Input Processing)

Enrich tool documentation by generating diverse potential user queries

Model or implementation: text-bison@001 (Google Vertex AI) or GPT-3.5/Mistral

Intent Extractor (Input Processing)

Extract core tool-related requests from verbose user input

Model or implementation: text-bison@001 (Google Vertex AI) or GPT-3.5/Mistral

Multi-view Similarity Scorer

Match extracted intents against enriched tool documents

Model or implementation: textembedding-gecko@003 (Google Vertex AI)

Novel Architectural Elements

Multi-view similarity ranking algorithm: Instead of a single query-doc match, it matches n extracted intents against m synthetic-query-augmented document views and aggregates the rankings.

Modeling

Base Model: Google Vertex AI text-bison@001 (also tested with GPT-3.5 and Mistral-7B)

Compute: Not reported in the paper (Inference-only approach)

Comparison to Prior Work

vs. ToolLLM: Completely unsupervised (no training data needed) vs. supervised training
vs. HyDE: Generates queries from docs (offline) rather than docs from queries (online), avoiding concept drift in hypothetical documents
vs. Standard Dense Retrieval: Extracts specific intents and uses synthetic query augmentation to bridge the semantic gap between user language and technical documentation

Limitations

Relies on the quality of the LLM for synthetic query generation; hallucinations could introduce noise
Simple zero-shot prompting used for diversity; more sophisticated generation methods could be explored
Intent extraction relies purely on LLM internal knowledge without feedback from tool execution results

📊 Experiments & Results

Evaluation Setup

Tool retrieval accuracy and downstream agent success rate

Benchmarks:

ToolBench (Instruction following with 16k+ APIs)
ToolE (Tool selection (single and multi-tool))

Metrics:

nDCG@5
Pass rate (for end-to-end agent)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ToolE (single-tool)	nDCG@5	0.6522	0.7821	+0.1299
ToolE (multi-tool)	nDCG@5	0.5296	0.7231	+0.1935
ToolBench (I1 category)	nDCG@5	0.5962	0.6110	+0.0148
End-to-end evaluation showing Re-Invoke helps agents complete tasks better than supervised retrievers.
ToolBench (Average)	Pass Rate	52.63	56.07	+3.44

Experiment Figures

A conceptual failure case of standard retrieval vs Re-Invoke

Main Takeaways

Significant improvements in multi-tool scenarios suggest the 'Intent Extractor' is highly effective at decomposing complex user requests
The 'Query Generator' component (augmenting docs with synthetic queries) provides robust gains across all datasets, confirming that bridging the semantic gap between docs and queries is crucial
Zero-shot, training-free approach outperforms supervised baselines, offering a scalable solution for rapidly evolving tool ecosystems
Consistent performance across different LLM backbones (Vertex AI, GPT-3.5, Mistral) indicates the method is model-agnostic

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of dense vector retrieval (embeddings)
Familiarity with LLM prompting (Zero-shot, Few-shot)
Understanding of nDCG (Normalized Discounted Cumulative Gain) for ranking evaluation

Key Terms

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that considers the position of relevant items in the list

Recall@k: The proportion of relevant items found in the top-k retrieval results

BM25: Best Matching 25—a ranking function used in information retrieval to estimate the relevance of documents to a given search query based on term frequency

HyDE: Hypothetical Document Embeddings—a technique where an LLM generates a fake document to answer a query, which is then used for retrieval matching

Zero-shot: The ability of a model to perform a task (here, retrieval) without seeing any specific training examples for that task

Round-trip consistency: A quality check where a synthetic query generated from a document should successfully retrieve that same document

Pass rate: The percentage of instructions successfully completed by an agent within a limited budget