WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

📝 Paper Summary

Multi-call tool use with flexible plan Tool-use post-training

The paper introduces a benchmark to evaluate if LLMs can correctly decide *when* to use tools, revealing that unnecessary tool usage hurts performance on general tasks.

Core Problem

Current tool-learning research assumes mandatory tool use, but real-world scenarios require models to discern whether tools are actually necessary.

Why it matters:

Unnecessary tool usage incurs redundant computational costs and latency.
Incorrect tool invocation parameters or misuse on general tasks can actively damage model accuracy compared to using internal knowledge.
Existing benchmarks focus on *how* to use tools, ignoring the critical decision of *whether* to use them.

Concrete Example: When asked about a Quarter Pounder's weight (113.4g vs 120.5g), ChatGPT invokes a calculator tool with incorrect parameters, resulting in a wrong answer, whereas its internal knowledge would likely have sufficed.

Key Novelty

WTU-Eval Benchmark & Decision-Aware Fine-Tuning

Establishes a dual-region evaluation framework: comparing performance on tool-necessary tasks vs. general tasks both with and without tool access.
Identifies 'Whether-or-Not' decision making as a distinct capability gap in current LLMs.
Proposes a fine-tuning strategy focusing on the 'Thought' and 'Action' steps of ReACT to teach models to refrain from using tools when internal knowledge is sufficient.

Architecture

The four evaluation regions (R1-R4) of the WTU-Eval benchmark, illustrating the relationship between Task Type (General vs. Tool) and Tool Availability (With vs. Without).

Evaluation Highlights

Fine-tuning Llama2-7B on the proposed dataset improves average performance by 14% across the benchmark.
Incorrect tool usage decreases by 16.8% after fine-tuning Llama2-7B.
On PIQA's Search Engine task, decision-aware fine-tuning improves accuracy by 40% while reducing calculator call rates by 74%.

Breakthrough Assessment

7/10

Highlights a critical but overlooked problem (tool overuse) and provides a solid benchmark/solution. The method is straightforward SFT, but the evaluation framework is the primary contribution.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of LLMs on a mixed set of tasks where T_tool requires tools and T_general is solvable by internal knowledge, with the model having optional access to a tool pool P.

Inputs: Natural language question q and a set of available tools P (descriptions and parameters).

Outputs: Final answer a, potentially derived via a ReACT trace (Thought, Action, Observation) or direct generation.

Pipeline Flow

Input Question
ReACT Loop (Thought -> Decision: Use Tool vs. Direct Answer)
If Tool: Action -> Observation -> Final Answer
If Direct: Final Answer

System Modules

Tool Decider (Implicit)

The LLM determines via the 'Thought' step whether the question requires external tools or internal knowledge.

Model or implementation: Various LLMs (Llama2, ChatGPT, etc.)

Tool Executor

Executes the API call generated by the LLM.

Model or implementation: External APIs (Baidu Translator, WolframAlpha, Bing Search, Wikipedia)

Novel Architectural Elements

Evaluation framework specifically segmenting tasks into four regions (R1-R4) based on tool necessity and availability to isolate the 'whether-to-use' decision impact.

Modeling

Base Model: Llama2-7B (for main fine-tuning experiments)

Training Method: Supervised Fine-Tuning (SFT) focusing on ReACT traces

Objective Functions:

Purpose: Optimize the generation of the correct Thought and Action steps.

Formally: Standard cross-entropy loss on the decision tokens.

Adaptation: Full fine-tuning

Training Data:

Curated dataset of 4000 examples from WTU-Eval training sets
GPT-4 used to generate the 'Thought' step for supervision
Correct actions selected for general questions (i.e., refrain from tool use)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench: WTU-Eval explicitly evaluates performance on *general* tasks when tools are available, measuring degradation due to misuse.
vs. Toolformer: WTU-Eval focuses on the decision boundary (whether-or-not) rather than just generation of calls.
vs. MetaTool: MetaTool assesses tool selection but ignores the *negative impact* (damage) of incorrect/unnecessary tool usage on general capabilities.

Limitations

Evaluation is limited to specific tools (Calculator, Search, Translator, Wiki) and may not generalize to all tool types.
Fine-tuning details (hyperparameters) are sparse, hindering exact reproduction.
Reliance on proprietary LLMs (GPT-4) for generating training data for the 'Thought' step.

Reproducibility

Code and benchmark not yet released (paper states 'We will release the WTU-Eval benchmark'). Prompts for zero-shot and few-shot are described in Appendix D. Specific hyperparameters for fine-tuning are missing.

📊 Experiments & Results

Evaluation Setup

ReACT framework used for all models. Comparison of 4 settings: R1 (Tool-Task, No Tool), R2 (Tool-Task, With Tool), R3 (General-Task, No Tool), R4 (General-Task, With Tool).

Benchmarks:

Tool-Usage Datasets (Tasks requiring external info/compute)
General Datasets (Commonsense/Reasoning solvable by LLM)

Metrics:

Accuracy
Call Rate (frequency of tool invocation)
Incorrect Tool Usage Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of tool availability on General Datasets (comparing R3 vs R4). Demonstrates that tool access often hurts performance on tasks that don't need tools.
BoolQ (General)	Accuracy (Zero-shot)	83.6	59.3	-24.3
PIQA (General)	Accuracy (Zero-shot)	78.4	46.1	-32.3
Impact of decision-aware fine-tuning (SFT) on Llama2-7B performance.
Average General Datasets	Performance Improvement	0	14	+14
General Datasets	Incorrect Tool Usage Rate	16.8	0	-16.8
PIQA (Search Engine)	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Distribution of error types for Llama2-7B in math (Tool) vs commonsense (General) tasks.

Main Takeaways

Access to tools is not always beneficial; for general tasks, it often degrades performance significantly (up to ~83% drop in worst cases) due to misuse.
Larger models (ChatGPT) benefit more from tool access in tool-heavy tasks than smaller models (Llama2), which struggle with complex tool prompts.
The order of negative impact of tools on general datasets is (Wikipedia Search, Search Engine) > (Translator, Calculator).
Fine-tuning specifically on the decision boundary ('Thought' step) effectively reduces unnecessary calls and restores performance on general tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of ReACT (Reasoning and Acting) prompting
Familiarity with tool-use in LLMs (API calls)
Basic knowledge of standard NLP benchmarks (GSM8K, BoolQ, etc.)

Key Terms

WTU-Eval: Whether-or-not Tool Usage Evaluation—the proposed benchmark containing both tool-obligatory and tool-unnecessary datasets.

ReACT: Reasoning and Acting—a prompting paradigm where models generate a Thought, Action, and Observation loop to solve tasks.

Chain of Thought (COT): A prompting technique where the model generates intermediate reasoning steps before the final answer.

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset to adapt its behavior.

Zero-shot: Evaluating a model without providing any example input-output pairs in the prompt.

Few-shot: Evaluating a model by providing a small number of example input-output pairs in the prompt.

Agent Tuning: Fine-tuning specifically designed to improve an LLM's ability to act as an agent (using tools, planning).

R1/R2/R3/R4: The four evaluation regions in WTU-Eval: R1/R3 are without tools (baselines), R2/R4 are with tool access (testing flexibility).