ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

📝 Paper Summary

Multi-call tool use with fixed plan Tool profiling

ToolScope improves LLM agent tool selection by automatically merging redundant tools into a unified graph and using a hybrid retrieval-reranking pipeline to filter irrelevant options.

Core Problem

Large toolsets often contain tools with overlapping names and descriptions (redundancy) and exceed LLM context limits, causing agents to select incorrect tools or fail to process the input.

Why it matters:

Ambiguous or duplicate tool definitions confuse retrievers and LLMs, leading to lower selection accuracy
Strict input context limits prevent agents from considering large numbers of tools, forcing aggressive filtering that may discard the correct tool
Existing methods improve individual tool documentation but fail to address cross-tool semantic overlap or simultaneously solve context limitations

Concrete Example: In a dataset like Seal-Tools, multiple tools might perform 'weather checking' with slightly different names. An agent might hallucinate or pick the wrong variant due to ambiguity. ToolScope merges these into one canonical tool entry.

Key Novelty

Automated Tool Graph Merging & Hybrid Context-Aware Retrieval

Constructs a tool graph where nodes are tools and edges represent semantic equivalence, then collapses connected components into single canonical tools to remove redundancy
Uses an LLM-based 'Auto-Correction' step to audit merge decisions, splitting clusters if they contain non-equivalent tools
Employs a hybrid retrieval strategy (sparse + dense) followed by a cross-encoder reranker to select only the top-k relevant tools, drastically compressing the context window

Architecture

The complete ToolScope pipeline including the offline Merger process and the online Retriever process.

Evaluation Highlights

+34.6% improvement in Correct Selection Rate (CSR) on Seal-Tools (a challenging multi-tool benchmark) using GPT-4o compared to baselines
+38.6% improvement in CSR on UltraTool using GPT-4o, demonstrating effectiveness in real-world scenarios
Reduces context length by 99.9% on Seal-Tools (from ~292k tokens to ~300 tokens) while maintaining high retrieval recall

Breakthrough Assessment

7/10

Strong empirical gains on difficult benchmarks and a practical solution to the 'too many tools' problem. The automated merging with LLM correction is a clever addition to standard retrieval pipelines.

⚙️ Technical Details

Problem Definition

Setting: Tool selection for LLM agents given a user query and a large library of potential tools

Inputs: User query q and a raw toolset T containing n tools

Outputs: A subset of tools T' relevant to q

Pipeline Flow

ToolScopeMerger: Candidate Generation → Relationship Classification → Graph Construction → Pruning → Auto-Correction
ToolScopeRetriever: Query Decomposition (if multi-tool) → Hybrid Retrieval → Reranking → Selection

System Modules

Candidate Generator (ToolScopeMerger)

Identify potential merge candidates using cosine similarity of embeddings

Model or implementation: gte-large (embedding model)

Relationship Classifier (ToolScopeMerger)

Determine if a pair of tools is semantically equivalent

Model or implementation: GPT-4o

Auto-Correction Validator (ToolScopeMerger)

Audit proposed clusters to fix incorrect merges

Model or implementation: GPT-4o

Hybrid Retriever (ToolScopeRetriever)

Retrieve initial candidate tools for a query

Model or implementation: BM25 (sparse) + gte-large (dense)

Reranker (ToolScopeRetriever)

Re-rank top candidates for final selection

Model or implementation: Cross-encoder

Novel Architectural Elements

Graph-based tool pruning pipeline with an explicit LLM-in-the-loop Auto-Correction module to fix merge errors before runtime
Two-stage retrieval pipeline specifically adapted for tool selection (sparse/dense hybrid + cross-encoder reranking)

Modeling

Base Model: GPT-4o, LLaMA-3.3-70B, Cohere-Command-R-08-2024 (evaluated as agents)

Training Method: Tuning-free approach (In-context learning + RAG)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolShed: ToolScope provides an automated graph-based merging solution rather than just identifying the overlap problem
vs. BM25/Dense: ToolScope modifies the underlying corpus (merging) AND uses a reranker, rather than just retrieving from the raw set
vs. EasyTool: ToolScope focuses on cross-tool redundancy rather than just individual tool documentation enhancement [not cited in paper]

Limitations

Relies on LLMs (GPT-4o) for the merging and auto-correction steps, which can be costly
Merging strategy currently selects a representative tool based on shortest name length, which might not always be the most descriptive choice
Performance gains from the Reranker component are negligible on simpler datasets like BFCL
Requires an initial 'offline' processing step to index and merge the toolset before it can be used for inference

Reproducibility

No public code URL provided in the paper. The paper lists prompts in Appendix H and hyperparameters (e.g., similarity threshold 0.82) in Section 4. Uses closed-source models (GPT-4o, Cohere) for evaluation.

📊 Experiments & Results

Evaluation Setup

Tool selection using LLMs as agents across varying domains

Benchmarks:

Seal-Tools (Multi-tool calling with large toolset)
UltraTool (Real-world planning and tool use)
BFCL (Berkeley Function Calling Leaderboard) (Single-turn tool calling)

Metrics:

CSR@k (Correct Selection Rate at top-k)
Recall@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results demonstrating ToolScope's improvement over baselines across different LLMs and datasets.
Seal-Tools	CSR@10	58.42	93.00	+34.58
UltraTool	CSR@10	26.36	65.00	+38.64
BFCL	CSR@10	89.00	97.80	+8.80
Impact of Auto-Correction module on performance.
UltraTool	CSR	57.1	65.0	+7.9
Context length reduction analysis.
Seal-Tools	Context Tokens	292107	317	-291790

Experiment Figures

t-SNE visualization of tool embeddings before and after merging.

Robustness analysis of CSR across different levels of documentation quality (Low, Medium, High).

Main Takeaways

ToolScopeMerger is the primary driver of performance gains, drastically reducing semantic overlap which confuses retrievers.
Auto-Correction is essential for high-noise datasets (UltraTool, Seal-Tools), preventing over-merging of distinct tools.
The approach is robust to low-quality tool documentation, maintaining high accuracy even when descriptions are sparse.
Reranking provides significant boosts at lower k values (k=5, 10) but has diminishing returns as k increases.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) pipelines
Familiarity with vector embeddings and similarity search
Basic graph theory (connected components)

Key Terms

CSR: Correct Selection Rate—the percentage of queries where the predicted toolset exactly matches the ground-truth set

Recall@k: The proportion of ground-truth tools correctly retrieved among the top-k predictions

Hybrid Retrieval: Combining keyword-based search (like BM25) with semantic vector search (dense embeddings) to improve retrieval accuracy

Reranker: A model (often a cross-encoder) that re-scores a small set of retrieved candidates to improve the final ranking

Tool Pruning: The process of selecting a single representative tool from a cluster of semantically equivalent tools and removing the rest

Auto-Correction: An LLM-based audit step that validates whether a proposed cluster of merged tools is truly semantically equivalent