When2Call: When (not) to Call Tools

📝 Paper Summary

Tool-use post-training Benchmark datasets

When2Call is a benchmark and training regime that evaluates and improves an LLM's ability to decide *whether* to call a tool, ask for clarification, or admit inability, rather than just evaluating tool-call correctness.

Core Problem

Existing benchmarks (like BFCL) focus on whether a model calls the *correct* tool with correct parameters, ignoring scenarios where the model should *not* call a tool due to missing tools or missing parameters.

Why it matters:

Models often hallucinate tool calls when appropriate tools are unavailable (e.g., retrieving grades from a student record database that doesn't contain grades)
Models fail to ask follow-up questions when user prompts lack required parameters, instead hallucinating values
Current benchmarks only check if a call is generated or not in irrelevant cases, without distinguishing between correct refusal and hallucinated answers

Concrete Example: A user asks an LLM with access only to a 'Student Records' database (containing names/IDs) to 'Get student grades'. Current models might hallucinate a 'get_grades' tool or hallucinate the grades directly, rather than stating the tool cannot answer the question.

Key Novelty

Tool-Calling Decision Benchmark & Preference Optimization

Reformulates tool-calling evaluation as a multiple-choice task with four distinct behaviors: Direct Answer, Tool Call, Follow-up Question, and Unable to Answer
Introduces Reward-Aware Preference Optimization (RPO) training using negative examples (incorrect behaviors) to teach models when *not* to call tools without degrading standard tool-calling performance

Architecture

An example scenario illustrating the four types of responses in When2Call: Direct Answer (hallucinated), Tool Call (hallucinated due to tool mismatch), Follow-up Question (asking for missing param), and Unable to Answer (correct behavior when tools don't match).

Evaluation Highlights

Mistral-NeMo-Minitron-8B trained with RPO improves +8.6% on When2Call accuracy compared to standard SFT
RPO training achieves 87.1% on BFCL Irrelevance (detecting when no tool applies), significantly outperforming Llama 3.1 8B Instruct (56.0%)
Community models like Llama 3.1 8B struggle with 'Unable to Answer' scenarios, often scoring near 0% accuracy in that specific category

Breakthrough Assessment

7/10

Addresses a critical, overlooked gap in agentic AI (abstention/clarification). The multiple-choice formulation simplifies evaluation, and the RPO results show strong improvements. Usefulness depends on community adoption over existing standards like BFCL.

⚙️ Technical Details

Problem Definition

Setting: Tool-use decision making where the agent must select the appropriate action given a user query and a set of tool definitions

Inputs: User query q, List of tool specifications T (JSON schemas)

Outputs: Action selection: (a) Direct Answer, (b) Tool Call (JSON), (c) Follow-up Question, or (d) Unable to Answer

Pipeline Flow

Input Processing: User query + Tool Definitions
Model Inference: Generate log-probabilities for 4 options
Decision: Select option with highest probability

System Modules

Input Formatter

Formats tool specifications and query into the model-specific system prompt structure

Model or implementation: N/A (Deterministic)

Choice Selector

Calculates log-probabilities for the four pre-defined answer strings

Model or implementation: Evaluated Model (e.g., Llama 3.1, Mistral-NeMo-Minitron)

Novel Architectural Elements

Formulating the tool-calling decision process explicitly as a multiple-choice classification task (Direct vs. Tool vs. Follow-up vs. Refusal) during evaluation and preference training

Modeling

Base Model: Mistral-NeMo-Minitron-4B-Base and 8B-Base

Training Method: Supervised Fine-Tuning (SFT) followed by Reward-Aware Preference Optimization (RPO)

Objective Functions:

Purpose: Maximize likelihood of correct tool use behavior.

Formally: Standard Cross-Entropy Loss (SFT)
Purpose: Align model to prefer correct behavior (e.g., asking follow-up) over incorrect behavior (e.g., hallucinating params).

Formally: RPO Loss (uses chosen/rejected pairs)

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

SFT Blend: Public tool-calling datasets (xLAM, Glaive) + Synthesized When2Call data (2:1 ratio of tool-calling to non-tool-calling)
RPO Data: Pairs where 'Chosen' is the correct action and 'Rejected' is a sampled incorrect action (e.g., hallucinating a tool)

Key Hyperparameters:

learning_rate_sft_4b: 5e-6
learning_rate_sft_8b: 4e-6
learning_rate_rpo_4b: 9e-7
+ 4 more
learning_rate_rpo_8b: 7e-7
rpo_kl_penalty: 0.05
rpo_warmup_steps: 10
sft_warmup: No warm-up

Compute: 8x NVIDIA H100 GPU nodes, ~3-4 hours per model

Comparison to Prior Work

vs. BFCL: Explicitly evaluates 'when NOT to call' and 'ask for info' via multiple choice, whereas BFCL mostly checks JSON correctness
vs. ToolSandbox: When2Call uses multiple-choice to decouple decision-making from generation quality, avoiding parsing errors
vs. Refusal Benchmarks (e.g., R-Bench) [not cited in paper]: Focuses specifically on tool-related refusals (missing params/tools) rather than safety refusals

Limitations

Multiple-choice format might not perfectly reflect open-ended generation performance in the wild
Evaluation relies on log-probabilities, which requires access to model logits (harder for black-box APIs)
Synthetic data generation for 'unable to answer' scenarios relies on Mixtral 8x22B, which may introduce biases
Performance on the 'Unable to Answer' category remains low for many community models even after training

Reproducibility

Code: https://github.com/NVIDIA/When2Call

publicly available (https://github.com/NVIDIA/When2Call). Includes the benchmark dataset, training data, and evaluation scripts. Trained Mistral-NeMo-Minitron model weights are not explicitly linked in the paper text but code is provided.

📊 Experiments & Results

Evaluation Setup

Multiple-choice classification of tool-use behavior

Benchmarks:

When2Call (Multiple-choice decision making for tool use) [New]
BFCL v2 Live (Berkeley Function Calling Leaderboard) (Tool call generation accuracy)

Metrics:

Accuracy
Length-normalized Accuracy
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on the new When2Call benchmark show that RPO training significantly improves decision-making accuracy compared to baselines and standard SFT.
When2Call	Accuracy	83.1	91.7	+8.6
When2Call	Accuracy	57.7	91.7	+34.0
Performance on BFCL shows that When2Call training improves irrelevance detection without destroying tool-calling ability.
BFCL Live (Irrelevance)	Accuracy	46.2	87.1	+40.9
BFCL Live (AST)	Accuracy	84.9	85.9	+1.0

Main Takeaways

Community models (Llama 3, Qwen 2.5) are 'trigger-happy'—they often hallucinate tool calls when they should refuse or ask for info.
RPO training is superior to SFT for this task; SFT alone improved When2Call scores but degraded standard tool-calling accuracy (BFCL AST) by making the model too conservative.
The 'Unable to Answer' category is the hardest for current models, often showing near-zero accuracy in off-the-shelf models.
Performance does not scale linearly with model size; Qwen 2.5 72B did not strictly outperform smaller variants on all splits.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool calling / function calling mechanisms
Familiarity with preference optimization (DPO/RPO)
Basic knowledge of evaluation metrics (Accuracy, F1)

Key Terms

BFCL: Berkeley Function Calling Leaderboard—a standard benchmark for evaluating the accuracy of LLM tool calls

RPO: Reward-Aware Preference Optimization—an alignment technique that optimizes models based on preference pairs (chosen vs. rejected) to maximize a reward signal

SFT: Supervised Fine-Tuning—training a model on a dataset of input-output pairs to teach it specific behaviors or formats

Tool Hallucination: When a model generates a call for a tool that was not provided in the system prompt

NeMo-Aligner: A scalable toolkit by NVIDIA for model alignment using techniques like SFT, RLHF, and DPO

Log-probability: The logarithm of the probability assigned by the model to a token; used here to score multiple-choice answers without requiring generation

APIGen: A dataset of synthetic API calls and corresponding natural language queries used for training tool-using models