← Back to Paper List

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Kangyun Ning, Yisong Su, Xueqiang Lv, Yuanzhe Zhang, Jian Liu, Kang Liu, Jinan Xu
Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China, College of Computer and Data Science, Fuzhou University, The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, CAS
arXiv.org (2024)
Agent Benchmark Reasoning

📝 Paper Summary

Multi-call tool use with flexible plan Tool-use post-training
The paper introduces a benchmark to evaluate if LLMs can correctly decide *when* to use tools, revealing that unnecessary tool usage hurts performance on general tasks.
Core Problem
Current tool-learning research assumes mandatory tool use, but real-world scenarios require models to discern whether tools are actually necessary.
Why it matters:
  • Unnecessary tool usage incurs redundant computational costs and latency.
  • Incorrect tool invocation parameters or misuse on general tasks can actively damage model accuracy compared to using internal knowledge.
  • Existing benchmarks focus on *how* to use tools, ignoring the critical decision of *whether* to use them.
Concrete Example: When asked about a Quarter Pounder's weight (113.4g vs 120.5g), ChatGPT invokes a calculator tool with incorrect parameters, resulting in a wrong answer, whereas its internal knowledge would likely have sufficed.
Key Novelty
WTU-Eval Benchmark & Decision-Aware Fine-Tuning
  • Establishes a dual-region evaluation framework: comparing performance on tool-necessary tasks vs. general tasks both with and without tool access.
  • Identifies 'Whether-or-Not' decision making as a distinct capability gap in current LLMs.
  • Proposes a fine-tuning strategy focusing on the 'Thought' and 'Action' steps of ReACT to teach models to refrain from using tools when internal knowledge is sufficient.
Architecture
Architecture Figure Figure 2
The four evaluation regions (R1-R4) of the WTU-Eval benchmark, illustrating the relationship between Task Type (General vs. Tool) and Tool Availability (With vs. Without).
Evaluation Highlights
  • Fine-tuning Llama2-7B on the proposed dataset improves average performance by 14% across the benchmark.
  • Incorrect tool usage decreases by 16.8% after fine-tuning Llama2-7B.
  • On PIQA's Search Engine task, decision-aware fine-tuning improves accuracy by 40% while reducing calculator call rates by 74%.
Breakthrough Assessment
7/10
Highlights a critical but overlooked problem (tool overuse) and provides a solid benchmark/solution. The method is straightforward SFT, but the evaluation framework is the primary contribution.
×