Enhancing Tool Retrieval with Iterative Feedback from Large Language Models

📝 Paper Summary

Tool learning Tool retrieval

This paper enhances tool retrieval by using an LLM to iteratively critique retrieved tools and refine the user query, bridging the gap between retriever selection and actual tool utility.

Core Problem

Standard dense retrievers struggle with the complexity of tool descriptions and are misaligned with downstream tool-usage models, often retrieving tools that look semantically similar but are functionally irrelevant.

Why it matters:

Real-world tool libraries are vast and constantly updating, making fine-tuning or full-context learning impossible
Existing retrievers treat tools like documents, failing to grasp the specific functional nuances required for execution
Misalignment hinders the LLM from accessing truly useful tools, degrading overall system performance

Concrete Example: A user asks for 'stock prices'. A standard retriever might fetch a generic 'calculator' or 'currency converter' because of keyword overlap, while the LLM actually needs a specific 'stock_market_api'. The proposed method allows the LLM to critique the 'calculator' as irrelevant and refine the query to 'fetch real-time stock market data'.

Key Novelty

Iterative Feedback-based Tool Retrieval

Leverages the tool-using LLM as a 'critic' to evaluate retrieved tools before execution
The LLM provides structured feedback (Comprehension, Assessment, Refinement) to rewrite the search query
Uses iteration-aware training to teach the dense retriever to handle progressively refined queries and hard negatives

Architecture

The iterative feedback framework. It shows the cycle of Retrieval -> LLM Feedback (Comprehension, Assessment, Refinement) -> Refined Instruction -> Re-Retrieval.

Evaluation Highlights

Achieves best performance on TR-Bench (In-Domain), surpassing dense retrieval baselines by significant margins (e.g., +6.3% Recall@5 vs ToolBench)
Outperforms baselines on Out-of-Domain evaluation, demonstrating robustness to unseen tools
Iterative refinement consistently improves retrieval quality over multiple rounds (e.g., Recall@5 improves from 0.73 to 0.81 over 3 iterations)

Breakthrough Assessment

7/10

Offers a logical and effective solution to the retriever-LLM misalignment problem in tool use. While the core concept of 'LLM feedback' is known, applying it iteratively to the *pre-execution* retrieval phase is a valuable specific contribution.

⚙️ Technical Details

Problem Definition

Setting: Tool Retrieval: Given user instruction q and tool set D, select a subset of K tools that aid the LLM in answering q.

Inputs: User instruction q, Tool set D = {d_1, ..., d_N}

Outputs: Top-K retrieved tools D_retrieved

Pipeline Flow

Initial Retrieval: User Query → Retriever → Initial Tool List
Feedback Loop (Repeat T times): Tool List + Query → LLM Critic → Refined Query → Retriever → Updated Tool List
Final Output: Top-K Tools

System Modules

Dense Retriever

Encodes query and tools to retrieve candidates based on vector similarity

Model or implementation: BERT-base architecture (Dual Encoder)

LLM Critic

Analyzes retrieved tools and generates a refined query to guide the retriever

Model or implementation: Evaluated with various LLMs (e.g., ChatGPT)

Novel Architectural Elements

Iteration-aware retrieval mechanism: The retriever inputs include a special token (e.g., 'Iteration t') to distinguish between initial and refined queries during the feedback loop

Modeling

Base Model: BERT-base (uncased) for the Dense Retriever

Training Method: Contrastive Learning with Hard Negatives

Objective Functions:

Purpose: Minimize distance between instruction and positive tools while maximizing distance to negatives.

Formally: L(q_t) = -log ( exp(sim(q_t, d+)) / [exp(sim(q_t, d+)) + sum(exp(sim(q_t, d-)))] )
Purpose: Sum losses across all iterations.

Formally: L = sum(alpha_t * L(q_t))

Adaptation: Full fine-tuning of the Retriever

Training Data:

TR-Bench: constructed from ToolBench data
Includes In-Domain (seen categories) and Out-of-Domain (unseen categories) splits

Key Hyperparameters:

batch_size: 128
learning_rate: 2e-5
epoch: 50 (early stopping patience 5)
+ 4 more
max_length_query: 128
max_length_document: 128
temperature_softmax: Not reported in the paper
iterations_T: 3 (inference setting)

Compute: Experiments run on a single NVIDIA A800 GPU

Comparison to Prior Work

vs. ToolBench-Retriever: This method adds an iterative feedback loop where the LLM critiques results to refine the query, rather than a single-pass retrieval
vs. DPR: Explicitly trains on 'hard negatives' identified during the feedback process (tools that look relevant but aren't)
vs. CRITIC [not cited in paper]: CRITIC focuses on verifying tool *outputs* (execution), while this method focuses on verifying tool *selection* (retrieval) before execution

Limitations

Inference latency increases linearly with the number of feedback iterations
Relies on the capability of the LLM to provide accurate feedback; a weak LLM might mislead the retriever
Requires training a specific retriever model (not plug-and-play with black-box retrievers like OpenAI embeddings)

Reproducibility

Code: https://github.com/travis-xu/TR-Feedback

Code is available at https://github.com/travis-xu/TR-Feedback. The paper details the prompt templates for feedback generation and the loss functions. Hyperparameters for the retriever training (LR, batch size) are provided.

📊 Experiments & Results

Evaluation Setup

Tool Retrieval on the TR-Bench benchmark constructed by the authors

Benchmarks:

TR-Bench (In-Domain) (Retrieving tools from categories seen during training) [New]
TR-Bench (Out-of-Domain) (Retrieving tools from categories NOT seen during training) [New]

Metrics:

NDCG@1
NDCG@3
NDCG@5
Recall@1
Recall@3
Recall@5
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on TR-Bench In-Domain and Out-of-Domain settings compared to strong baselines.
TR-Bench (In-Domain)	NDCG@5	0.7578	0.8037	+0.0459
TR-Bench (In-Domain)	Recall@5	0.8650	0.9280	+0.0630
TR-Bench (Out-of-Domain)	NDCG@5	0.5510	0.6272	+0.0762
TR-Bench (Out-of-Domain)	Recall@5	0.6860	0.8250	+0.1390
Ablation study showing the impact of the iterative feedback mechanism.
TR-Bench (In-Domain)	Recall@5	0.73	0.81	+0.08

Experiment Figures

Performance curves (Recall@5 and NDCG@5) across iteration steps (0 to 3).

Main Takeaways

Iterative feedback significantly boosts retrieval performance, especially in Out-of-Domain settings where the retriever hasn't seen the tools before.
The method is robust to different LLM backbones (e.g., Llama vs ChatGPT) used for generating feedback.
Hard negative sampling based on the retriever's own confusion (high-scoring incorrect tools) is crucial for training the retriever to be discerning.

📚 Prerequisite Knowledge

Prerequisites

Dense Retrieval (Dual Encoders)
Contrastive Learning
Tool Learning / Tool Use in LLMs

Key Terms

Dense Retriever: A retrieval system that encodes queries and documents into vectors and finds matches via similarity search

Hard Negative Sampling: Training technique where incorrect items that are very similar to the correct item are used as negative examples to force the model to learn subtle distinctions

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items

Recall@K: The proportion of relevant items found in the top-K retrieved results

Contrastive Learning: A learning paradigm where the model learns to pull positive pairs close together and push negative pairs apart in vector space

BM25: A probabilistic retrieval function based on term frequency and inverse document frequency (sparse retrieval)