← Back to Paper List

When2Call: When (not) to Call Tools

Hayley Ross, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara
Harvard University, NVIDIA
arXiv (2025)
Agent Benchmark Factuality RL

📝 Paper Summary

Tool-use post-training Benchmark datasets
When2Call is a benchmark and training regime that evaluates and improves an LLM's ability to decide *whether* to call a tool, ask for clarification, or admit inability, rather than just evaluating tool-call correctness.
Core Problem
Existing benchmarks (like BFCL) focus on whether a model calls the *correct* tool with correct parameters, ignoring scenarios where the model should *not* call a tool due to missing tools or missing parameters.
Why it matters:
  • Models often hallucinate tool calls when appropriate tools are unavailable (e.g., retrieving grades from a student record database that doesn't contain grades)
  • Models fail to ask follow-up questions when user prompts lack required parameters, instead hallucinating values
  • Current benchmarks only check if a call is generated or not in irrelevant cases, without distinguishing between correct refusal and hallucinated answers
Concrete Example: A user asks an LLM with access only to a 'Student Records' database (containing names/IDs) to 'Get student grades'. Current models might hallucinate a 'get_grades' tool or hallucinate the grades directly, rather than stating the tool cannot answer the question.
Key Novelty
Tool-Calling Decision Benchmark & Preference Optimization
  • Reformulates tool-calling evaluation as a multiple-choice task with four distinct behaviors: Direct Answer, Tool Call, Follow-up Question, and Unable to Answer
  • Introduces Reward-Aware Preference Optimization (RPO) training using negative examples (incorrect behaviors) to teach models when *not* to call tools without degrading standard tool-calling performance
Architecture
Architecture Figure Figure 1
An example scenario illustrating the four types of responses in When2Call: Direct Answer (hallucinated), Tool Call (hallucinated due to tool mismatch), Follow-up Question (asking for missing param), and Unable to Answer (correct behavior when tools don't match).
Evaluation Highlights
  • Mistral-NeMo-Minitron-8B trained with RPO improves +8.6% on When2Call accuracy compared to standard SFT
  • RPO training achieves 87.1% on BFCL Irrelevance (detecting when no tool applies), significantly outperforming Llama 3.1 8B Instruct (56.0%)
  • Community models like Llama 3.1 8B struggle with 'Unable to Answer' scenarios, often scoring near 0% accuracy in that specific category
Breakthrough Assessment
7/10
Addresses a critical, overlooked gap in agentic AI (abstention/clarification). The multiple-choice formulation simplifies evaluation, and the RPO results show strong improvements. Usefulness depends on community adoption over existing standards like BFCL.
×