Zhejiang University,
Beijing University of Posts and Telecommunications,
University of Cambridge
arXiv, 10/2024
(2024)
AgentBenchmarkRL
📝 Paper Summary
Tool retrievalBenchmark datasetsQuery alignment
MTRB is a new benchmark for retrieving tools from massive repositories, accompanied by a QTA framework that aligns user queries with tool documentation using reinforcement learning on limited data.
Core Problem
Existing retrieval methods struggle with massive tool repositories due to context length limits and the semantic gap between user queries and technical tool documentation.
Why it matters:
Real-world applications involve thousands of tools (e.g., >100,000 characters for documentation), far exceeding the context windows of many LLMs (e.g., Llama-2's 4096 tokens)
Standard fine-tuning methods like Sentence-BERT require large annotated datasets, which are scarce for new tool domains
Current benchmarks focus on tool usage (planning/calling) rather than the preliminary step of retrieving the correct tools from a large database
Concrete Example:A user query 'give me a movie cover from the Harry Potter collection' requires coordinating multiple tools like 'GET /search/collection', 'GET /collection/{id}', and 'GET /movie/{id}/images'. Standard retrievers fail to link the abstract request to these specific API endpoints without extensive training data.
Key Novelty
Query-Tool Alignment (QTA) with Direct Preference Optimization
Uses an LLM to rewrite user queries into forms that better match tool documentation, bridging the semantic gap
Aligns these rewrites using Direct Preference Optimization (DPO) derived from retrieval ranking feedback, rather than requiring a separate reward model
Specifically designed for low-resource settings, showing effectiveness with as few as one annotated training sample
Architecture
The QTA framework pipeline, including query rewriting, retrieval ranking, and the DPO training process.
Evaluation Highlights
+93.28% improvement in Sufficiency@5 on the MTRB-RestBench subset compared to baseline methods
Achieves 78.53% improvement in Sufficiency@5 on MTRB-RestBench using just a single annotated training sample
Consistently outperforms state-of-the-art models in top-5 and top-10 retrieval tasks across the full MTRB benchmark
Breakthrough Assessment
7/10
Significant improvements in low-resource settings and a necessary new benchmark for massive tool retrieval. The approach is data-efficient but relies on existing retrieval backends.
⚙️ Technical Details
Problem Definition
Setting: Retrieving a small subset of essential tools (Golden Tools GT) from a large tool database T containing M tools based on a user query q.
Inputs: User query q and a massive tool database T (tool names + descriptions)
Outputs: A ranked list of tools, where the top-k should contain the Golden Tools
Pipeline Flow
LLM Rewriter (rewrites user query q into q_re)
Retrieval Model (uses q_re to search tool database)
Ranking Function (evaluates retrieval quality to generate DPO signals)
System Modules
LLM Rewriter
Rewrites the user query to align with tool documentation semantics
Model or implementation: Not explicitly specified (likely Llama-series based on context)
Retrieval Model
Retrieves tools based on the rewritten query
Model or implementation: Frozen retrieval model (e.g., Sentence-BERT, Contriever)
Ranking Function
Calculates scores for rewritten queries based on how well they retrieve ground truth tools
Model or implementation: Algorithmic (Modified DCG)
Novel Architectural Elements
Utilization of hidden ranking information from a frozen retrieval model to construct preference pairs (chosen/rejected) for DPO training of a query rewriter
Modeling
Base Model: Llama-3-8B-Instruct (implied by tokenizer usage, but explicit model for QTA initialization not strictly named, likely Llama-2 or 3)
Training Method: Direct Preference Optimization (DPO)
Objective Functions:
Purpose: Optimize the policy to prefer rewrites that result in better tool rankings.
Only 10 samples per subset used for training (Low Resource)
Compute: Not reported in the paper
Comparison to Prior Work
vs. ToolBench: QTA aligns queries using LLM rewriting and DPO rather than fine-tuning the retriever itself
vs. Sentence-BERT: QTA requires significantly less data (low-resource) compared to the millions of pairs needed for S-BERT
vs. General Retrieval: Focuses on 'Sufficiency' (getting ALL necessary tools) rather than just Recall
Limitations
Evaluation is limited to a small number of samples (270 test samples)
Depends on the quality of the underlying frozen retrieval model
The random sampling of tool documents for the LLM context might miss relevant tools during the rewriting phase
Reproducibility
The paper describes the MTRB benchmark construction in detail (300 samples total, derived from RestBench, MetaTool, ToolBench). Code URL is not provided in the text. The specific LLM used for the QTA back-end is not explicitly named in the main text (Llama-3 tokenizer is mentioned for stats).
📊 Experiments & Results
Evaluation Setup
Retrieval of tools from a repository of 2,645 tools using low-resource training data.
Benchmarks:
MTRB-RestBench (Tool Retrieval) [New]
MTRB-ToolBench (Tool Retrieval) [New]
MTRB-MetaTool (Tool Retrieval) [New]
Metrics:
Sufficiency@5 (S@5)
Sufficiency@10 (S@10)
NDCG@5 (N@5)
NDCG@10 (N@10)
Recall@k (implied/discussed)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
MTRB-RestBench
Sufficiency@5
16.67
32.22
+15.55
MTRB-RestBench
Sufficiency@5
16.67
29.76
+13.09
Main Takeaways
QTA significantly improves retrieval sufficiency (ensuring all necessary tools are found), which is critical for complex tool-use tasks
The method is highly data-efficient, showing strong performance with as few as one training sample via DPO
MTRB establishes a challenging benchmark where baselines perform poorly, highlighting the difficulty of massive tool retrieval