WTU-Eval: Whether-or-not Tool Usage Evaluation—the proposed benchmark containing both tool-obligatory and tool-unnecessary datasets.
ReACT: Reasoning and Acting—a prompting paradigm where models generate a Thought, Action, and Observation loop to solve tasks.
Chain of Thought (COT): A prompting technique where the model generates intermediate reasoning steps before the final answer.
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset to adapt its behavior.
Zero-shot: Evaluating a model without providing any example input-output pairs in the prompt.
Few-shot: Evaluating a model by providing a small number of example input-output pairs in the prompt.
Agent Tuning: Fine-tuning specifically designed to improve an LLM's ability to act as an agent (using tools, planning).
R1/R2/R3/R4: The four evaluation regions in WTU-Eval: R1/R3 are without tools (baselines), R2/R4 are with tool access (testing flexibility).