EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction

📝 Paper Summary

Tool profiling Tool-use post-training

EASYTOOL transforms lengthy, inconsistent tool documentation into concise instructions with synthesized usage scenarios, improving LLM agent performance while reducing token costs.

Core Problem

Existing tool documentations are diverse, redundant, and often incomplete (missing usage scenarios), causing LLMs to struggle with context limits and incorrect parameter prediction.

Why it matters:

Massive redundant information in API docs (e.g., URLs, pricing) consumes valuable context window tokens
Inconsistent formats across different tool providers (e.g., RapidAPI vs. Hugging Face) make it hard for agents to parse functionality reliably
Lack of concrete examples leads to high parameter error rates when LLMs attempt to invoke tools

Concrete Example: A RapidAPI documentation might contain 2,530 tokens of metadata like 'pricing: FREEMIUM' and server URLs, obscuring the core 'List Movies' function. Without an example, an LLM fails to predict the required boolean parameter 'with_rt_ratings', causing execution failure.

Key Novelty

Two-stage Tool Instruction Generation

Stage 1 (Description Generation): Uses an LLM to distill raw documentation into a standardized 'Tool Description' that strips irrelevant metadata (like IDs/URLs) and retains only functional purposes.
Stage 2 (Guideline Construction): Synthesizes 'Tool Functionality Guidelines' by generating concrete usage scenarios and example parameter payloads (e.g., JSON inputs) to ground the model's understanding.

Architecture

Comparison between raw tool documentation (messy, redundant) and EASYTOOL's unified tool instruction (concise, structured with examples).

Evaluation Highlights

Reduces token consumption by 70.43% on ToolBench and 97.35% on RestBench compared to original documentation
Achieves 72.8% Success Rate with GPT-4 + DFSDT on ToolBench, outperforming the baseline GPT-4 + DFSDT (64.3%)
Boosts ChatGPT's Correct Path Rate on RestBench-TMDB from ~45% (ReAct) to ~65% (EASYTOOL), significantly improving tool sequencing

Breakthrough Assessment

7/10

Simple yet highly effective preprocessing method that solves a major practical bottleneck (token cost and documentation quality) for agentic systems. Strong empirical results across multiple benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Given a user request T and a set of raw tool documentations, the goal is to select and execute tools {a_1...a_K} to solve T.

Inputs: User request and raw, noisy tool documentation (JSON/text)

Outputs: Sequence of tool calls and final answer

Pipeline Flow

Preprocessing: Tool Description Generation (Raw Doc → Concise Description)
Preprocessing: Functionality Guideline Construction (Concise Description → Scenarios + Examples)
Inference: Tool Retrieval (Query → Candidate Tools)
Inference: Tool Execution (Selection → Parameter Generation → API Call)

System Modules

Tool Description Generator (Preprocessing)

Distill raw documentation into functional summaries

Model or implementation: ChatGPT (gpt-3.5-turbo)

Guideline Constructor (Preprocessing)

Synthesize usage examples

Model or implementation: ChatGPT (gpt-3.5-turbo)

Agent Inference

Execute user request using processed instructions

Model or implementation: Evaluated on ChatGPT, GPT-4, Vicuna-7B, Mistral-Instruct-7B

Modeling

Base Model: Evaluated on multiple models: ChatGPT, GPT-4, Vicuna-7B, Mistral-Instruct-7B

Comparison to Prior Work

vs. ToolLLaMA: EASYTOOL is a plug-and-play preprocessing method that works with frozen LLMs (including closed-source) without fine-tuning.
vs. RestGPT: Reduces token cost significantly (97% on RestBench) and synthesizes better usage examples compared to retrieving raw docs.
vs. LLMLingua [not cited in paper]: Unlike general prompt compression, EASYTOOL creates structured, human-readable instructions specific to API semantics rather than just token pruning.

Limitations

Cannot handle tool documentation exceeding the context window of the preprocessing LLM (ChatGPT) without splitting.
Processes tools individually, ignoring inter-tool dependencies which might be crucial for complex workflows.
Relies on the instruction-following capability of the base model; less effective on weaker models.

Reproducibility

Code: https://github.com/microsoft/JARVIS/tree/main/easytool

Code publicly available (https://github.com/microsoft/JARVIS/tree/main/easytool). Uses standard benchmarks (ToolBench, RestBench, FuncQA). The preprocessing relies on ChatGPT, so exact reproduction depends on OpenAI API versioning.

📊 Experiments & Results

Evaluation Setup

Tool-use evaluation on real-world APIs and math reasoning tasks.

Benchmarks:

ToolBench (Real-world REST API calls (I2-Category, I3-Instruction subsets))
RestBench (Web service task planning (TMDB subset))
FuncQA (Numerical reasoning with tool use)

Metrics:

Success Rate (GPT-4 judged)
Pass Rate
Win Rate (vs ChatGPT-ReAct)
Correct Path Rate (CP%)
NDCG (for retrieval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on ToolBench (Average of I2 and I3 subsets). EASYTOOL improves Success Rate across most models.
ToolBench	Success Rate	64.3	72.8	+8.5
ToolBench	Success Rate	62.3	69.8	+7.5
ToolBench	Success Rate	61.0	70.5	+9.5
ToolBench (I2 + I3)	NDCG@5	38.8	85.6	+46.8
RestBench (TMDB)	Correct Path Rate	45.0	65.0	+20.0

Experiment Figures

Error rates of tool calls (Tool Name Error vs Parameter Error) for different models with/without EASYTOOL.

Correct Path Rate (CP%) on RestBench for different methods.

Main Takeaways

Significant token reduction: Tool instructions are 70% (ToolBench) to 97% (RestBench) shorter than raw docs.
Plug-and-play generalization: Enhances open-source models like Mistral-Instruct-7B to outperform specialized fine-tuned models like ToolLLaMA.
Reduced error rates: Synthesized usage scenarios help models avoid common 'Parameter Error' and 'Tool Name Error' pitfalls.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with LLM-based agents (ReAct, DFSDT)
Understanding of API structures (RESTful APIs, parameters)
Basic prompt engineering concepts

Key Terms

DFSDT: Depth-First Search Decision Tree—an inference strategy where the LLM explores possible tool execution paths as a tree search

ReAct: Reasoning + Acting—a prompting paradigm where the model generates a thought trace before taking an action (tool call)

ToolBench: A benchmark dataset containing diverse user requests and real-world REST APIs from RapidAPI

RestBench: A benchmark for evaluating tool-use in real-world web service scenarios (e.g., TMDB, Spotify)

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality used here to evaluate tool retrieval

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps

dense retrieval: Retrieving relevant items based on semantic vector similarity (embeddings) rather than keyword matching