Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models

📝 Paper Summary

Tool retrieval Benchmark datasets Query alignment

MTRB is a new benchmark for retrieving tools from massive repositories, accompanied by a QTA framework that aligns user queries with tool documentation using reinforcement learning on limited data.

Core Problem

Existing retrieval methods struggle with massive tool repositories due to context length limits and the semantic gap between user queries and technical tool documentation.

Why it matters:

Real-world applications involve thousands of tools (e.g., >100,000 characters for documentation), far exceeding the context windows of many LLMs (e.g., Llama-2's 4096 tokens)
Standard fine-tuning methods like Sentence-BERT require large annotated datasets, which are scarce for new tool domains
Current benchmarks focus on tool usage (planning/calling) rather than the preliminary step of retrieving the correct tools from a large database

Concrete Example: A user query 'give me a movie cover from the Harry Potter collection' requires coordinating multiple tools like 'GET /search/collection', 'GET /collection/{id}', and 'GET /movie/{id}/images'. Standard retrievers fail to link the abstract request to these specific API endpoints without extensive training data.

Key Novelty

Query-Tool Alignment (QTA) with Direct Preference Optimization

Uses an LLM to rewrite user queries into forms that better match tool documentation, bridging the semantic gap
Aligns these rewrites using Direct Preference Optimization (DPO) derived from retrieval ranking feedback, rather than requiring a separate reward model
Specifically designed for low-resource settings, showing effectiveness with as few as one annotated training sample

Architecture

The QTA framework pipeline, including query rewriting, retrieval ranking, and the DPO training process.

Evaluation Highlights

+93.28% improvement in Sufficiency@5 on the MTRB-RestBench subset compared to baseline methods
Achieves 78.53% improvement in Sufficiency@5 on MTRB-RestBench using just a single annotated training sample
Consistently outperforms state-of-the-art models in top-5 and top-10 retrieval tasks across the full MTRB benchmark

Breakthrough Assessment

7/10

Significant improvements in low-resource settings and a necessary new benchmark for massive tool retrieval. The approach is data-efficient but relies on existing retrieval backends.

⚙️ Technical Details

Problem Definition

Setting: Retrieving a small subset of essential tools (Golden Tools GT) from a large tool database T containing M tools based on a user query q.

Inputs: User query q and a massive tool database T (tool names + descriptions)

Outputs: A ranked list of tools, where the top-k should contain the Golden Tools

Pipeline Flow

LLM Rewriter (rewrites user query q into q_re)
Retrieval Model (uses q_re to search tool database)
Ranking Function (evaluates retrieval quality to generate DPO signals)

System Modules

LLM Rewriter

Rewrites the user query to align with tool documentation semantics

Model or implementation: Not explicitly specified (likely Llama-series based on context)

Retrieval Model

Retrieves tools based on the rewritten query

Model or implementation: Frozen retrieval model (e.g., Sentence-BERT, Contriever)

Ranking Function

Calculates scores for rewritten queries based on how well they retrieve ground truth tools

Model or implementation: Algorithmic (Modified DCG)

Novel Architectural Elements

Utilization of hidden ranking information from a frozen retrieval model to construct preference pairs (chosen/rejected) for DPO training of a query rewriter

Modeling

Base Model: Llama-3-8B-Instruct (implied by tokenizer usage, but explicit model for QTA initialization not strictly named, likely Llama-2 or 3)

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize the policy to prefer rewrites that result in better tool rankings.

Formally: L_DPO(π_θ; π_ref) = -E[log σ(β * log(π_θ(q_w|q)/π_ref(q_w|q)) - β * log(π_θ(q_l|q)/π_ref(q_l|q)))]

Training Data:

300 total samples across 3 subsets
Only 10 samples per subset used for training (Low Resource)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench: QTA aligns queries using LLM rewriting and DPO rather than fine-tuning the retriever itself
vs. Sentence-BERT: QTA requires significantly less data (low-resource) compared to the millions of pairs needed for S-BERT
vs. General Retrieval: Focuses on 'Sufficiency' (getting ALL necessary tools) rather than just Recall

Limitations

Evaluation is limited to a small number of samples (270 test samples)
Depends on the quality of the underlying frozen retrieval model
The random sampling of tool documents for the LLM context might miss relevant tools during the rewriting phase

Reproducibility

The paper describes the MTRB benchmark construction in detail (300 samples total, derived from RestBench, MetaTool, ToolBench). Code URL is not provided in the text. The specific LLM used for the QTA back-end is not explicitly named in the main text (Llama-3 tokenizer is mentioned for stats).

📊 Experiments & Results

Evaluation Setup

Retrieval of tools from a repository of 2,645 tools using low-resource training data.

Benchmarks:

MTRB-RestBench (Tool Retrieval) [New]
MTRB-ToolBench (Tool Retrieval) [New]
MTRB-MetaTool (Tool Retrieval) [New]

Metrics:

Sufficiency@5 (S@5)
Sufficiency@10 (S@10)
NDCG@5 (N@5)
NDCG@10 (N@10)
Recall@k (implied/discussed)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MTRB-RestBench	Sufficiency@5	16.67	32.22	+15.55
MTRB-RestBench	Sufficiency@5	16.67	29.76	+13.09

Main Takeaways

QTA significantly improves retrieval sufficiency (ensuring all necessary tools are found), which is critical for complex tool-use tasks
The method is highly data-efficient, showing strong performance with as few as one training sample via DPO
MTRB establishes a challenging benchmark where baselines perform poorly, highlighting the difficulty of massive tool retrieval

📚 Prerequisite Knowledge

Prerequisites

Information Retrieval (Recall, NDCG)
Large Language Models (In-context learning)
Reinforcement Learning (Direct Preference Optimization)

Key Terms

MTR: Massive Tool Retrieval—the task of finding relevant tools from a very large repository based on a user query

QTA: Query-Tool Alignment—the proposed framework to rewrite user queries to better match tool documents

DPO: Direct Preference Optimization—an algorithm for training language models to satisfy preferences without an explicit reward model

Sufficiency@k: A custom binary metric that is 1 if the top-k retrieved results contain ALL necessary tools for a task, and 0 otherwise

RestBench: A dataset of RESTful APIs used as a source for the benchmark

ToolBench: A large-scale instruction-tuning benchmark for tool use

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that prioritizes correct items appearing earlier in the list