Department of Computer Science and Technology, Tsinghua University
arXiv
(2026)
AgentRAGBenchmark
📝 Paper Summary
Tool RetrievalAgentic AI
MFTR improves tool retrieval by standardizing heterogeneous documentation into four functional fields and rewriting user queries to strictly align with this schema, enabling fine-grained multi-aspect relevance scoring.
Core Problem
Traditional retrieval treats tools as flat text, failing to address the structural inconsistency of documentation, the semantic mismatch between high-level user queries and atomic tools, and the strict constraints of parameter validity.
Why it matters:
LLM context windows cannot fit all available tools, making retrieval the bottleneck for general-purpose agents
Existing ad-hoc retrieval methods rely on semantic similarity, which ignores whether a tool is actually executable (e.g., missing required parameters)
Documentation from different sources (e.g., Gorilla vs. MetaTool) is highly heterogeneous, confusing standard retrievers
Concrete Example:A user asks to 'analyze sales trends', which implies multiple tools (retrieval, analysis, plotting). A standard retriever might miss the specific 'plot_chart' tool because its documentation is purely technical (parameter names) without high-level descriptions, or retrieve a tool for which the user lacks the required input ID.
Key Novelty
Multi-Field Tool Retrieval (MFTR) Framework
Standardize all tool docs into a 4-field schema (Description, Parameters, Response, Examples) using an LLM to normalize heterogeneous sources
Rewrite user queries into 'Tool Needs' that map directly to these 4 fields, using Pseudo-Relevance Feedback to inject repository-specific terminology
Calculate relevance independently for each field (including a specific penalty mechanism for missing required parameters) and aggregate them with learnable weights
Architecture
The MFTR framework pipeline showing the two parallel paths: tool documentation standardization and query rewriting, meeting at the multi-field relevance computation.
Evaluation Highlights
Achieves SOTA performance on five datasets and a mixed benchmark (specific numbers not reported in the provided text)
Demonstrates that masking different documentation fields has varying impacts on retrieval, validating the need for multi-field modeling (Figure 1)
Successfully generalizes across different retriever backbones by decoupling structural alignment from the underlying scoring model
Breakthrough Assessment
7/10
Strong methodological contribution in recognizing tools != documents. The standardization and field-specific scoring address real pain points. Score limited only by lack of numeric results in the provided text to verify the magnitude of improvement.
⚙️ Technical Details
Problem Definition
Setting: Retrieving a subset of relevant tools T_q from a repository T given a user query q
Inputs: User query q, Heterogeneous tool repository T = {(t_1, d_1), ...}
Outputs: Ranked list of tools relevant to q
Pipeline Flow
Documentation Standardization: Raw Docs → Standardized Schema (LLM)
Convert heterogeneous raw tool docs into unified 4-field schema
Model or implementation: LLM (Specific model not reported in text)
Query Rewriter (Inference)
Decompose query into 'Tool Needs' and align with tool schema
Model or implementation: LLM with Pseudo-Relevance Feedback
Multi-Field Scorer (Inference)
Compute relevance scores for each field and apply penalties
Model or implementation: Mathematical scoring function
Novel Architectural Elements
Four-field standardized schema (Description, Parameters, Response, Examples) designed specifically for tool utility
Adaptive parameter penalty mechanism that uses a learnable threshold to penalize tools missing 'required' arguments
Modeling
Base Model: Not reported in the provided text (likely uses off-the-shelf LLMs like GPT or Llama for rewriting)
Training Method: Pairwise Ranking Optimization
Objective Functions:
Purpose: Maximize margin between positive and negative tools.
Formally: L = max(0, 1 - (S(q, t^+) - S(q, t^-)))
Key Hyperparameters:
tau: Learnable threshold for parameter matching normalization
alpha: Control parameter for sigmoid shape in penalty function
w_f: Learnable weights for field aggregation
Compute: Not reported in the provided text
Comparison to Prior Work
vs. ToolBench: MFTR models multiple fields explicitly (params vs intent) rather than treating the doc as a single semantic blob
vs. EasyTool: MFTR uses a specific 4-field schema and aligns the query to this schema, rather than just refining the doc text
vs. Standard RAG (BM25/Dense): MFTR introduces a parameter-validity penalty, ensuring retrieved tools are not just semantically relevant but executable
Limitations
Relies on LLM to rewrite queries and standardize docs; hallucinations in rewriting could mislead retrieval
Parameter matching relies on semantic similarity of argument names/types, which may still fail for complex schemas
Computational cost of LLM-based query rewriting is higher than simple embedding-based retrieval
Code and prompts are publicly available at https://github.com/LittleDinoC/MFTR. The paper text provided does not specify exact training hyperparameters or the specific LLM used for standardization.
📊 Experiments & Results
Evaluation Setup
Tool retrieval from large-scale repositories given user queries
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Percentage change in NDCG when treating tool docs as single documents with/without masking specific fields across different datasets.
Main Takeaways
Numeric results were not provided in the input text, but the authors claim SOTA performance across 5 datasets.
Masking different fields (e.g., description vs. parameters) in documentation has inconsistent effects on baseline performance (Figure 1), proving that 'flat text' retrieval is ill-suited for tools.
The adaptive weighting mechanism allows the model to dynamically balance semantic intent (Description/Examples) with execution constraints (Parameters).
📚 Prerequisite Knowledge
Prerequisites
Information Retrieval basics (BM25, Dense Retrieval)
Large Language Models (for rewriting/standardization)
Tool use in AI Agents
Key Terms
MFTR: Multi-Field Tool Retrieval—the proposed framework that aligns query and tool representations via structured fields
Ad-hoc retrieval: Standard information retrieval tasks where the system finds documents from a static collection relevant to a user query
Pseudo-Relevance Feedback: A technique where top-ranked documents from an initial search are assumed to be relevant and used to refine the query
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items
MCP: Model Context Protocol—a standard for connecting AI assistants to systems and data
RAG: Retrieval-Augmented Generation—providing LLMs with external data retrieved at runtime