UnsolvableQuery: A special virtual tool introduced by the authors that the model must call when it determines a specific sub-goal cannot be achieved with available tools.
Solvability Detection: Level-1 diagnostic task where the model determines if a user query is addressable with the given toolset (Binary Classification).
Solution Planning: Level-2 diagnostic task requiring the model to decompose queries into sub-goals and assign tools (or UnsolvableQuery) to them.
Missing-Tool Analysis: Level-3 diagnostic task where the model must describe the functionality of the missing tool required for an unsolvable sub-goal.
EM: Exact Match—a metric checking if the model's binary solvability prediction matches the ground truth.
PR: Progress Rate—a metric inspired by Precision@k that measures the accuracy of the predicted tool sequence up to the first mismatch.
MS: Matching Score—a metric measuring the semantic similarity (via embedding cosine similarity) between the model's description of a missing tool and the ground truth description.
MNT: Missing Necessary Tools—a scenario where a required tool is removed from the set to induce unsolvability.
LFT: Limited Functionality Tools—a scenario where tools exist but lack specific features (e.g., wrong language support) needed for the query.
PT: Potential Tools—a scenario where the environment (OS, Web) implies tools exist (e.g., 'rm' command) that are not in the provided safe list.