Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

📝 Paper Summary

Agentic AI Tool Learning Long-context reasoning

Tool-DC improves LLM tool-calling by splitting candidate tools into smaller groups for parallel inference, verifying results against schema constraints, and aggregating valid outputs for a final refined decision.

Core Problem

LLMs struggle to select correct tools from massive, noisy candidate lists because long contexts dilute reasoning signals and semantically similar tools cause confusion.

Why it matters:

Current methods rely on retrievers; if the retriever misses the 'golden' tool, the LLM fails immediately
Scaling candidate tools (e.g., from <10 to 20) causes significant performance degradation in existing models due to context length and noise
Existing error-checking methods rely on rigid, manually defined checklists that lack flexibility

Concrete Example: When the number of candidate tools scales from <10 to 20, the performance of standard models drops significantly (e.g., Qwen2.5-1.5B drops by over 25 points) because the model cannot effectively filter irrelevant tools or distinguish between tools with similar semantics but different arguments.

Key Novelty

Divide-and-Conquer with Try-Check-Retry

Decomposes the search space into smaller 'anchor groups' (Try) to reduce reasoning difficulty and allow parallel processing
Filters hallucinations using a rule-based validator (Check) that enforces strict schema compliance (function names, argument keys, data types)
Aggregates only the validated candidates for a final, self-reflected decision (Retry), effectively removing noise before the final generation

Architecture

The overall framework of Tool-DC showing the two variants: Training-Free (TF) and Training-Based (TB). It depicts the flow from input query to final tool call.

Evaluation Highlights

Tool-DC (TF) achieves +25.10% average gain over the 'All Functions' baseline on the extended setting (20 candidate tools) using Qwen2.5-1.5B
Tool-DC (TB) enables Qwen2.5-7B-Instruct to achieve an 83.16% overall score on the BFCL benchmark, outperforming proprietary models like OpenAI o3 and Claude-Haiku-4.5
In the standard setting, Tool-DC (TF) improves Qwen2.5-1.5B performance by +4.61% compared to using all functions

Breakthrough Assessment

8/10

Significant performance jumps on noisy, long-context settings and impressive distillation results where a 7B model beats proprietary giants. The method is logically sound and addresses a key bottleneck in agentic systems.

⚙️ Technical Details

Problem Definition

Setting: Given a query q and a large library of N candidate tools T, generate a sequence of valid tool invocations y = (t, alpha)

Inputs: User query q, Set of N candidate tools T

Outputs: Tool invocation y* consisting of selected tool t and arguments alpha

Pipeline Flow

Preprocessing: Retrieval & Grouping (Try)
Local Inference: Parallel Generation (Try)
Validation: Schema Checking (Check)
Global Decision: Aggregation & Final Call (Retry)

System Modules

Retriever (Preprocessing)

Select top-K relevant tools to seed the grouping process

Model or implementation: BM25

Anchor Grouper (Preprocessing)

Divide total tools into manageable subspaces (groups) to ensure coverage and reduce noise

Model or implementation: Algorithmic splitting

Local Inference

Generate initial tool calls for each subspace in parallel

Model or implementation: LLM (e.g., Qwen2.5)

Consistency Validator

Filter invalid hallucinations using schema constraints

Model or implementation: Rule-based engine

Global Refinement

Make the final tool call using only the validated candidates

Model or implementation: LLM (e.g., Qwen2.5)

Novel Architectural Elements

Strategic Anchor Grouping: A specific method of creating parallel inference batches where 'anchor' (high relevance) tools are paired with 'distractor' (low relevance) tools to ensure the model isn't overwhelmed by similar high-relevance tools in one context
Try-Check-Retry Feedback Loop: A formalized inference pipeline that explicitly validates tool calls against a schema before feeding them back into the context for a final decision

Modeling

Base Model: Qwen2.5 family (1.5B, 3B, 7B), Llama-3 series, Qwen3-4B

Training Method: Supervised Fine-Tuning (SFT) on synthesized Chain-of-Thought (CoT) data

Objective Functions:

Purpose: Optimize model to generate correct reasoning traces and tool calls.

Formally: Minimize negative log-likelihood of reasoning traces (r) and final invocation (y*) given query (q) and tools (T).

Training Data:

Source: xlam-function-calling-60k dataset
Process: Apply Tool-DC (TF) pipeline (Try-Check-Retry) to the raw data to collect correct reasoning trajectories and successful tool invocations as CoT examples

Key Hyperparameters:

K (number of groups): min(5, N)
retriever: BM25

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. HiTEC-ICL: Tool-DC uses dynamic schema validation instead of static manual checklists and handles long contexts via grouping
vs. ToolGT: Tool-DC explicitly divides the search space and validates candidates, whereas ToolGT focuses on prompting strategies within a single context
vs. Top-K Retrieval: Tool-DC includes 'distractor' tools in groups to ensure recall (via anchor grouping) and uses a retry mechanism, avoiding failure when the retriever misses the golden tool
+ 1 more
vs. Rest-bench [not cited in paper]: Rest-bench uses retry logic based on execution feedback, while Tool-DC validates based on schema constraints before execution

Limitations

Tool-DC (TF) increases inference latency due to multiple forward passes (parallel local inference + global refinement)
Performance depends on the quality of the consistency validator; if schema checks are too loose, hallucinations persist
The approach assumes that the correct tool is present in the initial large pool and can be retrieved or covered by the grouping strategy

Reproducibility

Code availability is not explicitly provided in the text. The method uses standard open-source models (Qwen, Llama) and standard libraries (BM25, LLaMA-Factory). Detailed prompt templates are referenced in Appendix A.7.

📊 Experiments & Results

Evaluation Setup

Tool-calling evaluation on synthetic and hand-crafted datasets, measuring strict code match.

Benchmarks:

BFCL (Berkeley Function-Calling Leaderboard) (Tool/Function Calling (Live and Non-Live))
ACEBench (Tool Calling (Normal split))

Metrics:

AST Exact-Match Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Standard Setting Comparison: Evaluating models on the original benchmark tool lists (fewer tools).
BFCL & ACEBench (Standard)	Average Score	55.20	59.81	+4.61
Extended Setting Comparison: Evaluating robustness by scaling candidate tools to 20 (adding noise).
BFCL & ACEBench (Extended)	Average Score	29.48	54.58	+25.10
BFCL (Extended)	Accuracy	See Figure 6 (approx 35-50 depending on K)	See Figure 6 (consistently higher)	Positive gap
Training-Based (TB) Results: Fine-tuning models using the Tool-DC paradigm.
BFCL	Overall Score	82.52	83.16	+0.64
ACEBench	Accuracy	Not explicitly reported as single number	High gain implied	+24.95
Ablation Studies: Validating the Try-Check-Retry components.
BFCL (Extended)	Average Accuracy	64.77	36.79	-27.98
BFCL (Extended)	Average Accuracy	64.77	5.26	-59.51

Experiment Figures

Performance comparison on BFCL across various LLMs (Llama, Gemma, GPT-4o, DeepSeek) using Tool-DC (TF).

Impact of group count K on performance in the Extended Setting.

Main Takeaways

Small models suffer disproportionately from long contexts and noisy tools; Tool-DC (TF) bridges this gap significantly (+25.10%).
The 'Retry' stage is the most critical component; without aggregating and re-evaluating, the system collapses (performance drops to ~5%).
Tool-DC (TB) allows open-weights models (e.g., Qwen2.5-7B) to match or exceed the performance of closed-source SOTA models (OpenAI o3, Claude) by internalizing the verification logic.

📚 Prerequisite Knowledge

Prerequisites

Function Calling / Tool Use in LLMs
Retrieval-Augmented Generation (RAG)
Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)

Key Terms

BFCL: Berkeley Function-Calling Leaderboard—a benchmark dataset for evaluating the ability of LLMs to invoke external tools

AST: Abstract Syntax Tree—a tree representation of code structure used here to strictly evaluate if the predicted tool call matches the ground truth syntactically

Schema Constraints: Rules defining valid tool usage, including function names, required argument keys, and data types (string, integer, etc.)

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset to adapt it for a particular task

BM25: Best Matching 25—a probabilistic information retrieval algorithm used to rank documents (or tools) based on keyword matching

Anchor Grouping: A strategy to split a large list of tools into smaller subsets, ensuring the most relevant tools are distributed across groups to avoid missing them

Consistency Validator: A module that checks if a generated tool call strictly adheres to the defined API schema (name, arguments, types)