ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

📝 Paper Summary

Hallucination suppression Agentic AI Benchmarks and evaluation

ToolBH is a diagnostic benchmark assessing how tool-augmented LLMs handle unsolvable tasks, analyzing hallucinations through a three-level depth framework and three breadth scenarios.

Core Problem

Existing tool-use benchmarks assume all necessary tools are provided and solvable, failing to evaluate how LLMs handle 'unsolvable' scenarios where tools are missing, irrelevant, or limited.

Why it matters:

Real-world tool libraries are often incomplete or mismatched for specific user queries, leading models to hallucinate non-existent tools or misuse existing ones.
Current benchmarks (AgentBench, ToolBench) focus on successful completion rather than failure handling, masking critical reliability issues in AGI development.

Concrete Example: A user asks for a video download, but the toolset only contains a weather API. Instead of stating the task is unsolvable, the LLM hallucinates a 'VideoDownloader' tool or misuses the weather API to attempt the task.

Key Novelty

Multi-level Hallucination Diagnostic Benchmark (ToolBH)

Decomposes evaluation into 'depth' (diagnosing where the error occurs: solvability detection, planning, or specific tool analysis) and 'breadth' (different failure scenarios like missing tools or limited functionality).
Introduces an 'UnsolvableQuery' tool concept that models must actively select when a sub-goal cannot be met, rather than just outputting a generic error.

Architecture

The three-level diagnostic framework of ToolBH.

Evaluation Highlights

Open-weight model Llama-3-70B achieves only 32% of Gemini-1.5-Pro's score and 40% of GPT-4o's score, showing a massive gap in unsolvable task handling.
Gemini-1.5-Pro and GPT-4o achieve total scores of only 45.3 and 37.0 out of 100, indicating significant difficulty even for SOTA models.
Proprietary models generally struggle with instrumental reasoning errors, while open-weight models suffer more from solvability hallucinations (misjudging task feasibility).

Breakthrough Assessment

8/10

Significant contribution by shifting focus from 'can it solve X' to 'does it know it can't solve X'. The multi-level diagnostic approach provides granular insights into hallucination mechanics.

⚙️ Technical Details

Problem Definition

Setting: Evaluating tool-augmented LLMs on tasks that are partially or fully unsolvable due to tool constraints.

Inputs: User query q and a set of available tools T.

Outputs: A determination of solvability, a plan of tool calls (if solvable parts exist), or a description of missing tools (if unsolvable).

Pipeline Flow

Level-1: Solvability Detection (Can this be solved?)
Level-2: Solution Planning (Decompose into sub-goals and map to tools)
Level-3: Missing-Tool Analysis (Describe what is missing for unsolvable steps)

System Modules

Solvability Detector

Determine if the task is solvable given the tools

Model or implementation: Evaluated LLM (e.g., GPT-4o, Llama-3)

Planner

Generate a sequence of tool calls for sub-goals

Model or implementation: Evaluated LLM

Analyst

Explain why a step is unsolvable by describing the missing tool

Model or implementation: Evaluated LLM

Novel Architectural Elements

Three-level diagnostic pipeline (Detection -> Planning -> Analysis) specifically designed for failure modes rather than success modes.
Integration of an 'UnsolvableQuery' pseudo-tool into the action space to make failure detection explicit.

Modeling

Base Model: Evaluation benchmark applied to 14 models including GPT-4o, Gemini-1.5-Pro, Llama-3-70B, etc.

Comparison to Prior Work

vs. AgentBench/ToolBench: ToolBH specifically targets 'unsolvable' scenarios and diagnoses *why* they fail (depth), rather than just success rates.
vs. MetaTool: ToolBH goes beyond binary solvability to require planning (identifying *which* step fails) and analysis (describing *what* is missing).
vs. T-Eval [not cited in paper]: T-Eval assesses tool utilization but focuses on step-wise correctness in solvable paths, whereas ToolBH focuses on detection and explanation of unsolvability.

Limitations

Benchmark relies on model-based evaluation (embedding similarity) for Level-3 metrics, which may introduce noise.
Focus is primarily on textual tool use; does not cover multi-modal tool scenarios.
The definition of 'unsolvable' relies heavily on the specific constraints of the provided tool descriptions, which might be ambiguous in edge cases.

Reproducibility

Code: https://github.com/ToolBeHonest/ToolBeHonest

📊 Experiments & Results

Evaluation Setup

700 samples (50 solvable/50 unsolvable across 7 tasks). Models evaluated on ability to detect unsolvability, plan partial solutions, and describe missing tools.

Benchmarks:

ToolBH (Hallucination Diagnosis in Tool Use) [New]

Metrics:

Exact Match (EM) for Solvability Detection
Progress Rate (PR) for Solution Planning
Matching Score (MS) for Missing-Tool Analysis
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ToolBH	Total Score (0-100)	37.0	45.3	+8.3
ToolBH	Percentage of Best Proprietary Score	100	32	-68
Detailed breakdown of the best model (Gemini-1.5-Pro) and a leading open-weight model (Llama-3-70B) shows the massive gap in capability.

Experiment Figures

Conceptual illustration of hallucination types in tool use.

Main Takeaways

Larger parameters do not guarantee better performance in tool hallucination diagnosis; training data and response strategies (e.g., verbosity) play crucial roles.
Primary error source is Solvability Detection (Level-1); models frequently fail to recognize valid constraints or missing tools before even attempting planning.
Open-weight models suffer performance drops with verbose replies, while proprietary models (Gemini/GPT-4) handle longer reasoning chains better.
Proprietary models make fewer 'Non-existent Tool' errors but struggle more with 'Instrumental Reasoning' (logic of tool application).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Tool-Augmented LLMs (agents that call APIs)
Familiarity with Hallucination in LLMs (fabricating facts or tools)
Knowledge of ReAct (Reasoning + Acting) prompting
Basic precision/recall metrics

Key Terms

UnsolvableQuery: A special virtual tool introduced by the authors that the model must call when it determines a specific sub-goal cannot be achieved with available tools.

Solvability Detection: Level-1 diagnostic task where the model determines if a user query is addressable with the given toolset (Binary Classification).

Solution Planning: Level-2 diagnostic task requiring the model to decompose queries into sub-goals and assign tools (or UnsolvableQuery) to them.

Missing-Tool Analysis: Level-3 diagnostic task where the model must describe the functionality of the missing tool required for an unsolvable sub-goal.

EM: Exact Match—a metric checking if the model's binary solvability prediction matches the ground truth.

PR: Progress Rate—a metric inspired by Precision@k that measures the accuracy of the predicted tool sequence up to the first mismatch.

MS: Matching Score—a metric measuring the semantic similarity (via embedding cosine similarity) between the model's description of a missing tool and the ground truth description.

MNT: Missing Necessary Tools—a scenario where a required tool is removed from the set to induce unsolvability.

LFT: Limited Functionality Tools—a scenario where tools exist but lack specific features (e.g., wrong language support) needed for the query.

PT: Potential Tools—a scenario where the environment (OS, Web) implies tools exist (e.g., 'rm' command) that are not in the provided safe list.