AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification

📝 Paper Summary

Multi-turn w. user interactions Tool-use post-training

AskToAct improves tool-using agents by automatically generating clarification data from tool parameters and training models to self-correct errors during clarification dialogues.

Core Problem

Real-world user queries for tools are often ambiguous or incomplete, but current LLMs struggle to clarify intent effectively and lack mechanisms to recover from errors during multi-turn interactions.

Why it matters:

Current approaches rely on small, manually annotated datasets that fail to capture the diversity of real-world ambiguity.
Without error recovery, models accumulate mistakes (e.g., asking redundant questions) in multi-turn dialogues, degrading user experience and tool invocation accuracy.
Unspecified queries pose safety risks if models hallucinate parameters instead of asking for clarification.

Concrete Example: A user asks 'Book a hotel.' A standard model might hallucinate a location or date. A clarification model might ask 'Where?', but if it then forgets the answer and asks 'Where?' again (redundant clarification), the interaction fails. AskToAct detects this redundancy and corrects it.

Key Novelty

Self-Correcting Clarification via Reverse-Engineering Tool Parameters

Reverse-engineers ambiguous queries by removing parameters from valid tool calls, using the removed parameters as ground truth for what needs clarification.
Injects synthetic errors (e.g., redundant questions) into training dialogues and trains the model to detect and correct them, enabling robustness.
Uses selective masking during training to learn from error-correction pairs without reinforcing the erroneous behavior itself.

Architecture

The overall AskToAct framework, illustrating the data construction pipeline (left) and the self-correction training paradigm (right).

Evaluation Highlights

Recovers 57.08% of unspecified intents (Intent Recovery Rate), significantly outperforming baselines like GPT-4 (54.79%) and ToolLLaMA (35.84%).
Improves clarification efficiency by reducing the average number of turns by 10.46% compared to the base model while maintaining high accuracy.
Achieves 81.65% tool selection accuracy and 68.31% parameter resolution accuracy, surpassing previous clarification-focused methods.

Breakthrough Assessment

8/10

Offers a scalable, automated solution to the data scarcity problem in clarification research and introduces a novel self-correction mechanism that directly addresses error accumulation.

⚙️ Technical Details

Problem Definition

Setting: Tool-use dialogue where the agent must identify missing parameters in a user query q', ask clarification questions, and formulate a final tool call S.

Inputs: Unspecified or ambiguous user query q'

Outputs: Clarification questions qc or final tool invocation solution S = {(fi, Pi)}

Pipeline Flow

Input Processing: User Query Analysis
Clarification Loop: Error Detection -> Question Generation -> User Response
Tool Execution: Final Parameter Generation

System Modules

Clarification Agent

Analyzes query, detects missing parameters, and generates clarification questions while monitoring for errors.

Model or implementation: Llama-3-8B-Instruct (fine-tuned)

Novel Architectural Elements

Self-correction training paradigm: The model is trained to recognize specific error tokens (<SOE>, <EOE>) and generate corrections within the dialogue flow, effectively acting as its own critic during inference.

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT) with Selective Masking

Objective Functions:

Purpose: Minimize negative log-likelihood on target tokens while ignoring masked error segments.

Formally: Standard cross-entropy loss applied only to non-masked tokens (valid responses and corrections).

Adaptation: Full fine-tuning

Training Data:

35,261 unspecified queries generated via parameter removal from ToolBench/API-Bank data.
15,156 augmented dialogues containing error-correction pairs (redundant, irrelevant, imprecise, etc.).
Train/Test split not explicitly detailed as a ratio, but evaluation uses separate sets.

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 2
+ 3 more
max_length: 4096
warmup_ratio: 0.03
weight_decay: 0.0

Compute: Training performed on 8 NVIDIA A800 GPUs. Training time not reported.

Comparison to Prior Work

vs. ToolLLaMA: AskToAct explicitly models the clarification process and error correction, whereas ToolLLaMA assumes specified queries or relies on implicit capability.
vs. GPT-4: AskToAct achieves comparable or better performance (57.08% vs 54.79% IRR) with a much smaller 8B model via specialized training.
vs. Existing Clarification Methods (e.g., Qian et al. 2024): AskToAct automates data generation (vs. manual) and introduces error correction to handle multi-turn drift.

Limitations

Relies on the quality of the underlying tool datasets (ToolBench, API-Bank) for ground truth.
Error correction types are predefined (5 types), which may not cover all possible hallucination modes.
Evaluated primarily on English language queries.
Performance depends on the accuracy of the automated user simulator during training data generation.

Reproducibility

Code: https://github.com/zjunlp/AskToAct

Code is publicly available at https://github.com/zjunlp/AskToAct. Data construction pipeline is detailed in the paper. Base models (Llama-3) are open weights.

📊 Experiments & Results

Evaluation Setup

Tool use with ambiguous queries requiring multi-turn clarification.

Benchmarks:

AskToAct-Eval (Clarification and Tool Use) [New]

Metrics:

Intent Recovery Rate (IRR)
Tool Selection Accuracy (TSA)
Parameter Resolution Accuracy (PRA)
Average Turn Number (Avg Turn)
Win Rate (via GPT-4 judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on AskToAct-Eval benchmark showing superiority over base Llama-3 and ToolLLaMA.
AskToAct-Eval	Intent Recovery Rate (IRR)	45.00	57.08	+12.08
AskToAct-Eval	Intent Recovery Rate (IRR)	35.84	57.08	+21.24
AskToAct-Eval	Intent Recovery Rate (IRR)	54.79	57.08	+2.29
AskToAct-Eval	Average Turn Number	4.11	3.68	-0.43
Generalization to Unseen APIs (Level-3) demonstrates robustness.
AskToAct-Eval (Level-3)	Intent Recovery Rate (IRR)	41.67	51.19	+9.52

Experiment Figures

Win rate of AskToAct versus baselines (Llama-3-8B-Instruct, ToolLLaMA) as judged by GPT-4 across different capability dimensions.

Main Takeaways

AskToAct significantly outperforms general-purpose models (Llama-3, ToolLLaMA) in handling ambiguous queries, recovering more intent with fewer turns.
The self-correction mechanism is effective; removing error-correction data in ablations leads to lower Intent Recovery Rates.
The method generalizes well to unseen APIs and more complex multi-turn scenarios (Level-3 complexity).
Automated data construction via reverse-engineering is a viable path to creating large-scale, high-quality clarification datasets without human annotation.

📚 Prerequisite Knowledge

Prerequisites

Tool learning / API calling with LLMs
Supervised Fine-Tuning (SFT)
Dialogue state tracking concepts

Key Terms

Clarification: The process where an agent asks the user questions to resolve ambiguity or missing information needed for a tool call.

Intent Recovery Rate (IRR): A metric measuring the percentage of initially missing parameters that are successfully recovered through the clarification dialogue.

Tool Selection Accuracy (TSA): The accuracy with which the model selects the correct API tool to address the user's intent.

Parameter Resolution Accuracy (PRA): The accuracy with which the model extracts and fills the correct values for API parameters.

Selective Masking: A training technique where specific tokens (like error segments) are masked out during loss computation so the model learns to correct them without learning to generate them.

Unspecified Query: A user query that lacks necessary parameters (e.g., location, date) required to successfully execute a specific tool or API.