Multi-Field Tool Retrieval

📝 Paper Summary

Tool Retrieval Agentic AI

MFTR improves tool retrieval by standardizing heterogeneous documentation into four functional fields and rewriting user queries to strictly align with this schema, enabling fine-grained multi-aspect relevance scoring.

Core Problem

Traditional retrieval treats tools as flat text, failing to address the structural inconsistency of documentation, the semantic mismatch between high-level user queries and atomic tools, and the strict constraints of parameter validity.

Why it matters:

LLM context windows cannot fit all available tools, making retrieval the bottleneck for general-purpose agents
Existing ad-hoc retrieval methods rely on semantic similarity, which ignores whether a tool is actually executable (e.g., missing required parameters)
Documentation from different sources (e.g., Gorilla vs. MetaTool) is highly heterogeneous, confusing standard retrievers

Concrete Example: A user asks to 'analyze sales trends', which implies multiple tools (retrieval, analysis, plotting). A standard retriever might miss the specific 'plot_chart' tool because its documentation is purely technical (parameter names) without high-level descriptions, or retrieve a tool for which the user lacks the required input ID.

Key Novelty

Multi-Field Tool Retrieval (MFTR) Framework

Standardize all tool docs into a 4-field schema (Description, Parameters, Response, Examples) using an LLM to normalize heterogeneous sources
Rewrite user queries into 'Tool Needs' that map directly to these 4 fields, using Pseudo-Relevance Feedback to inject repository-specific terminology
Calculate relevance independently for each field (including a specific penalty mechanism for missing required parameters) and aggregate them with learnable weights

Architecture

The MFTR framework pipeline showing the two parallel paths: tool documentation standardization and query rewriting, meeting at the multi-field relevance computation.

Evaluation Highlights

Achieves SOTA performance on five datasets and a mixed benchmark (specific numbers not reported in the provided text)
Demonstrates that masking different documentation fields has varying impacts on retrieval, validating the need for multi-field modeling (Figure 1)
Successfully generalizes across different retriever backbones by decoupling structural alignment from the underlying scoring model

Breakthrough Assessment

7/10

Strong methodological contribution in recognizing tools != documents. The standardization and field-specific scoring address real pain points. Score limited only by lack of numeric results in the provided text to verify the magnitude of improvement.

⚙️ Technical Details

Problem Definition

Setting: Retrieving a subset of relevant tools T_q from a repository T given a user query q

Inputs: User query q, Heterogeneous tool repository T = {(t_1, d_1), ...}

Outputs: Ranked list of tools relevant to q

Pipeline Flow

Documentation Standardization: Raw Docs → Standardized Schema (LLM)
Query Processing: User Query → Rewritten Structured Query (LLM + PRF)
Multi-Field Matching: Structured Query + Standardized Doc → Field Relevance Scores
Aggregation: Field Scores → Final Ranking Score

System Modules

Standardization Module

Convert heterogeneous raw tool docs into unified 4-field schema

Model or implementation: LLM (Specific model not reported in text)

Query Rewriter (Inference)

Decompose query into 'Tool Needs' and align with tool schema

Model or implementation: LLM with Pseudo-Relevance Feedback

Multi-Field Scorer (Inference)

Compute relevance scores for each field and apply penalties

Model or implementation: Mathematical scoring function

Novel Architectural Elements

Four-field standardized schema (Description, Parameters, Response, Examples) designed specifically for tool utility
Adaptive parameter penalty mechanism that uses a learnable threshold to penalize tools missing 'required' arguments

Modeling

Base Model: Not reported in the provided text (likely uses off-the-shelf LLMs like GPT or Llama for rewriting)

Training Method: Pairwise Ranking Optimization

Objective Functions:

Purpose: Maximize margin between positive and negative tools.

Formally: L = max(0, 1 - (S(q, t^+) - S(q, t^-)))

Key Hyperparameters:

tau: Learnable threshold for parameter matching normalization
alpha: Control parameter for sigmoid shape in penalty function
w_f: Learnable weights for field aggregation

Compute: Not reported in the provided text

Comparison to Prior Work

vs. ToolBench: MFTR models multiple fields explicitly (params vs intent) rather than treating the doc as a single semantic blob
vs. EasyTool: MFTR uses a specific 4-field schema and aligns the query to this schema, rather than just refining the doc text
vs. Standard RAG (BM25/Dense): MFTR introduces a parameter-validity penalty, ensuring retrieved tools are not just semantically relevant but executable

Limitations

Relies on LLM to rewrite queries and standardize docs; hallucinations in rewriting could mislead retrieval
Parameter matching relies on semantic similarity of argument names/types, which may still fail for complex schemas
Computational cost of LLM-based query rewriting is higher than simple embedding-based retrieval

Reproducibility

Code: https://github.com/LittleDinoC/MFTR

Code and prompts are publicly available at https://github.com/LittleDinoC/MFTR. The paper text provided does not specify exact training hyperparameters or the specific LLM used for standardization.

📊 Experiments & Results

Evaluation Setup

Tool retrieval from large-scale repositories given user queries

Benchmarks:

Gorilla (API Retrieval (Python))
ToolBench (Tool Retrieval (Various))
Mixed Benchmark (Heterogeneous Tool Retrieval) [New]

Metrics:

NDCG
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Percentage change in NDCG when treating tool docs as single documents with/without masking specific fields across different datasets.

Main Takeaways

Numeric results were not provided in the input text, but the authors claim SOTA performance across 5 datasets.
Masking different fields (e.g., description vs. parameters) in documentation has inconsistent effects on baseline performance (Figure 1), proving that 'flat text' retrieval is ill-suited for tools.
The adaptive weighting mechanism allows the model to dynamically balance semantic intent (Description/Examples) with execution constraints (Parameters).

📚 Prerequisite Knowledge

Prerequisites

Information Retrieval basics (BM25, Dense Retrieval)
Large Language Models (for rewriting/standardization)
Tool use in AI Agents

Key Terms

MFTR: Multi-Field Tool Retrieval—the proposed framework that aligns query and tool representations via structured fields

Ad-hoc retrieval: Standard information retrieval tasks where the system finds documents from a static collection relevant to a user query

Pseudo-Relevance Feedback: A technique where top-ranked documents from an initial search are assumed to be relevant and used to refine the query

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

MCP: Model Context Protocol—a standard for connecting AI assistants to systems and data

RAG: Retrieval-Augmented Generation—providing LLMs with external data retrieved at runtime