ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

📝 Paper Summary

Synthetic data generation Tool-use post-training Agentic framework

ToolGrad generates high-quality tool-use datasets by iteratively constructing valid tool execution chains using textual feedback before synthesizing the corresponding user queries, achieving higher efficiency than query-first approaches.

Core Problem

Existing tool-use dataset generation methods first synthesize a user query and then use an agent (like DFS) to search for a solution, which is inefficient and error-prone.

Why it matters:

The standard 'query-first' approach relies on expensive agent exploration (trial and error) to find valid tool paths, wasting computational resources
Agent exploration has no guarantee of success, leading to low pass rates and discarded data during the generation process
Current methods struggle to scale because high-quality human annotation is impractical and agent-based search is computationally costly

Concrete Example: In ToolBench, a hypothetical user instruction is generated first, then a DFS agent tries to solve it. This often fails or requires many steps. ToolGrad inverts this: it builds a valid chain of API calls (e.g., searching for a movie, then getting its rating) first, ensuring validity, and *then* writes the user query 'What is the rating of movie X?'.

Key Novelty

Answer-First Iterative Tool-Chain Construction via Textual Gradients

Inverts the standard data generation paradigm by building the 'answer' (valid tool-use chain) first and synthesizing the 'question' (user query) last
Uses an iterative agentic loop inspired by ML optimization: selects APIs to augment the current workflow based on execution feedback (acting as textual 'gradients') rather than random exploration

Architecture

The iterative pipeline of ToolGrad for one generation step.

Evaluation Highlights

Achieved 100% data generation pass rate compared to 63.8% for the DFS-based baseline in ToolBench
Reduced generation cost significantly: 45.9 LLM calls per sample vs. 64.5 for baseline, and <30 tool calls vs. 34.3
Fine-tuned 1B model achieves ~99% tool recall, outperforming proprietary models like GPT-4.1 (~85%) on the test set

Breakthrough Assessment

8/10

Significantly improves data generation efficiency (100% pass rate) and quality. The 'answer-first' paradigm combined with 'textual gradients' is a clever structural innovation for synthetic data.

⚙️ Technical Details

Problem Definition

Setting: Synthetic dataset generation for tool-use LLMs

Inputs: A large-scale API database (approx 8.7k APIs)

Outputs: A dataset D = {(q, W, r)} containing user query q, tool-use workflow W, and final response r

Pipeline Flow

Group: Iterative Chain Construction: API Proposer → API Executor → API Selector → Workflow Updater
Group: Sample Synthesis: Final Workflow → Query/Response Synthesizer → Negative Sampling

System Modules

API Proposer (Iterative Chain Construction)

Filters a mini-batch of APIs to suggest the top-m candidates worth executing

Model or implementation: gpt-4.1-mini

API Executor (Iterative Chain Construction)

Executes the proposed APIs to generate execution reports

Model or implementation: gpt-4.1-mini (Tool-calling agent)

API Selector (Iterative Chain Construction)

Selects the single best API execution to augment the workflow, acting as the 'gradient' step

Model or implementation: gpt-4.1-mini

Workflow Updater (Iterative Chain Construction)

Updates the workflow and synthesizes the corresponding user query/response

Model or implementation: gpt-4.1-mini

Novel Architectural Elements

Inverted 'Answer-First' pipeline: Iteratively builds tool chains first, then generates queries
Gradient-inspired selection: Uses an 'API Selector' to choose the best API execution from a batch to update the state, analogous to a gradient step in optimization

Modeling

Base Model: Gemma-3 (1B, 4B, 12B)

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full fine-tuning

Training Data:

ToolGrad-5K dataset (5,000 samples)
90% training / 10% testing split

Key Hyperparameters:

epochs: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench: Inverts generation to 'answer-first', achieving 100% pass rate vs 63.8% and lower cost
vs. TextGrad: Optimizes dataset generation (discrete tool choices) rather than prompt engineering
vs. AnyTool [not cited in paper]: Focuses on efficient generation via gradients rather than hierarchical API selection

Limitations

Relies on proprietary models (GPT-4.1-mini) for the data generation pipeline
API library limited to ToolBench subset (approx 8.7k APIs)
Did not investigate the tool-use capability of reasoning models in depth
Evaluation focus is primarily on API call accuracy, less on complex multi-turn dialogue dynamics compared to some other benchmarks

Reproducibility

Code: https://zhongyi-zhou.github.io/toolgrad/

Code and dataset (ToolGrad-5K) are promised to be open-sourced. The paper uses proprietary models (gpt-4.1-mini) for the data generation process. API library is based on ToolBench (filtered to ~8.7k APIs).

📊 Experiments & Results

Evaluation Setup

Tool-use capability evaluation using held-out test set from ToolGrad-5K and OOD evaluation on ToolBench

Benchmarks:

ToolGrad-5K Test Set (Tool-use query execution) [New]
ToolBench (Out-of-distribution tool use)

Metrics:

Pass rate (Data Generation)
Number of ground-truth tool uses (Data Complexity)
Tool Recall
Success Rate (of tool calls)
Quality of Response (QoR - LLM judge)
Statistical methodology: Paired t-test used when comparing base vs. reasoning models.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Data Generation Efficiency: ToolGrad is compared against the DFS baseline from ToolBench regarding generation metrics.
Generation Statistics	Pass rate	63.8	100.0	+36.2
Generation Statistics	Avg. Ground-truth Tool Uses	3.3	6.1	+2.8
Generation Statistics	LLM Calls per Sample	64.5	45.9	-18.6
Model Performance: Fine-tuned ToolGrad models are compared against proprietary baselines on the ToolGrad-5K test set.
ToolGrad-5K Test Set	Tool Recall	85.2	99.2	+14.0
Out-of-Distribution (OOD) Performance: Models trained on ToolGrad-5K evaluated on ToolBench test set vs models trained on ToolBench.
ToolBench (OOD)	Win Rate	50.0	55.8	+5.8

Experiment Figures

Comparison of ToolGrad models (1B, 4B, 12B) against proprietary baselines (GPT-4, Claude, Gemini) on Tool Recall, Success Rate, and Quality of Response.

Performance comparison between base models and their 'reasoning' counterparts (e.g., GPT-4.1-mini vs o4-mini).

Main Takeaways

ToolGrad framework significantly reduces data generation costs (fewer LLM and tool calls) while increasing chain complexity and achieving a 100% pass rate.
Small models (1B parameters) fine-tuned on ToolGrad-5K outperform much larger proprietary models (GPT-4, Claude 3.7) on in-distribution tool-use tasks.
Models trained on ToolGrad-5K show strong OOD generalization, outperforming models trained on the original ToolBench dataset when evaluated on ToolBench.
Reasoning models (e.g., o1-mini equivalents) were found to underperform their base counterparts on tool-use tasks in this setup.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Tool-use/Function Calling in LLMs
Familiarity with Agentic workflows (ReAct, DFS)
Basic knowledge of synthetic data generation

Key Terms

textual gradients: Feedback from an LLM used to guide the optimization or generation process, analogous to numerical gradients in mathematical optimization

DFS: Depth-First Search—an algorithm used in baselines to explore possible tool-use paths by trying one branch deeply before backtracking

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to adapt it to a specific task

OOD: Out-Of-Distribution—evaluating a model on data different from what it was trained on

RAG: Retrieval-Augmented Generation—enhancing LLM responses by retrieving relevant information from external sources

pass rate: The percentage of data generation attempts that successfully result in a valid training sample

ToolBench: A baseline dataset and framework for tool-use LLMs that uses a query-first generation approach

API: Application Programming Interface—a set of rules allowing different software entities to communicate