ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

📝 Paper Summary

Multi-call tool use with flexible plan Multi-turn w. user interactions Benchmark datasets

ToolMind is a large-scale tool-use dataset created via multi-agent simulation and a novel function graph that improves LLM performance on complex, multi-turn benchmarks by enforcing rigorous turn-level reasoning quality.

Core Problem

Existing tool-use datasets suffer from limited scale, lack of explicit reasoning traces, insufficient multi-turn dynamics (like clarification questions), and rely on coarse trajectory-level validation that misses intermediate errors.

Why it matters:

Real-world user requests are often under-specified, requiring agents to proactively ask for clarification rather than hallucinating parameters
Turn-level errors in training data (even in successful trajectories) propagate during training, degrading the model's ability to reason correctly step-by-step
Current open-source datasets lack the diversity and dynamic user-assistant interactions needed to train robust generalist agents

Concrete Example: A user might ask 'What's the weather in Beijing?' without specifying a time. A standard dataset might force an immediate API call with a hallucinated date. ToolMind captures the dynamic where the agent asks 'For which date?' and the user clarifies, preserving the reasoning chain.

Key Novelty

Graph-Guided Multi-Agent Simulation with Fine-Grained Filtering

Constructs a 'Function Graph' where edges represent semantic compatibility between one tool's output and another's input, enabling the sampling of complex, realistic function chains via random walks
Simulates interactions using three distinct agents (User, Assistant, Tool) to generate dynamic multi-turn conversations, including clarification requests and tool execution feedback
Applies a two-stage filtering process that validates not just the final outcome (trajectory-level) but also scrubs individual erroneous steps (turn-level) to ensure high-quality reasoning traces

Architecture

The Data Synthesis Pipeline: From function graph construction to multi-agent simulation and two-stage filtering.

Evaluation Highlights

+13.6% improvement on Tau-bench (Retail) for Qwen3-14B after fine-tuning on ToolMind compared to base model
Surpasses GPT-4o on BFCL-v4 (Multi-Turn) accuracy with Qwen3-14B (79.24% vs 72.82%)
+5.4% improvement on Tau-2-bench (Retail) for Qwen3-8B after fine-tuning compared to base model

Breakthrough Assessment

8/10

Significant because it addresses the data bottleneck for complex tool use. The method of using a function graph for chain sampling and rigorous turn-level filtering yields SOTA results on difficult benchmarks like Tau-bench.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Language Models for multi-turn tool use and function calling

Inputs: Multi-turn dialogue history including user queries, assistant reasoning (thoughts), and tool execution results

Outputs: Assistant response containing either natural language (to user) or structured function calls (to environment)

Pipeline Flow

Function Collection & Graph Construction (Standardize & Connect)
Intent Synthesis (Random Walk on Graph)
Multi-Agent Simulation (User + Assistant + Tool)
Quality Filtering (Trajectory + Turn Level)

System Modules

Function Graph Constructor

Builds a directed graph of 20k+ tools where edges indicate input-output parameter compatibility

Model or implementation: Embedding model + LLM validator

User Agent (Simulation)

Initiates interaction based on synthesized intent and drives conversation

Model or implementation: LLM (User Simulator)

Assistant Agent (Simulation)

Responds to user, generates reasoning traces, and calls functions

Model or implementation: LLM (Agent)

Turn-Level Filter

Scans generated trajectories to mask invalid or suboptimal steps

Model or implementation: LLM-based Judge

Novel Architectural Elements

Function Graph-based intent sampling: Generates training scenarios by walking a graph of parameter compatibilities rather than just random selection
Dual-layer quality filtering: Combines trajectory-level coherence checks with fine-grained turn-level validation to remove specific erroneous steps within valid trajectories

Modeling

Base Model: Qwen3-8B and Qwen3-14B

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize prediction error on assistant responses.

Formally: Standard Cross-Entropy Loss on tokens of the anchored assistant response (think, content, tool calls).

Training Data:

160k synthesized instances (ToolMind)
200k augmented open-source instances (ToolACE, Glaive, etc.)
Total 360k samples

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 64 (global)
warmup_ratio: 0.03
+ 1 more
sequence_length: 64k

Comparison to Prior Work

vs. ToolACE: ToolMind introduces a 'Function Graph' for chain sampling to ensure better parameter connectivity and complexity
vs. APIGen: ToolMind adds explicit 'turn-level' filtering to remove intermediate errors, not just trajectory-level success/failure
vs. General Synthetic Data: ToolMind explicitly simulates 'under-specified' user requests to train clarification capabilities
+ 1 more
vs. Glaive [not cited in paper]: Glaive focuses on single-turn or simple multi-turn; ToolMind emphasizes complex multi-turn reasoning with self-correction and clarification [not cited in paper]

Limitations

Dependency on the quality of the base LLMs used for simulation and filtering
Computational cost of graph construction and multi-agent simulation is likely high
Synthetic environment tools are not actually executable, relying on a Tool Agent simulator

Reproducibility

Code: https://huggingface.co/datasets/Nanbeige/ToolMind

Dataset available at https://huggingface.co/datasets/Nanbeige/ToolMind. Training framework uses OpenRLHF. Exact prompts for synthesis/filtering provided in Appendix A.

📊 Experiments & Results

Evaluation Setup

Tool-use evaluation across diverse benchmarks (Multi-turn, Agentic, Single-turn)

Benchmarks:

BFCL-v4 (Function Calling (Single-turn, Multi-turn, Agentic))
Tau-bench (Complex User-Agent Dialogue (Retail/Airline domains))
Tau-2-bench (Agentic Dialogue where user also has tool access)

Metrics:

Accuracy (Acc)
Pass Rate

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on BFCL-v4 shows Qwen3 models trained on ToolMind outperforming significantly larger baselines, particularly in multi-turn scenarios.
BFCL-v4 (Multi-Turn)	Accuracy	72.82	79.24	+6.42
BFCL-v4 (Overall)	Accuracy	83.65	86.13	+2.48
Results on Tau-bench (Retail) demonstrate strong improvements in handling complex domain-specific policies.
Tau-bench (Retail)	Pass Rate	57.8	71.4	+13.6
Ablation studies confirm the necessity of both synthetic data and rigorous quality filtering.
Tau-bench (Avg)	Pass Rate	46.1	52.8	+6.7
BFCL-v4 (Overall)	Accuracy	81.67	84.34	+2.67

Experiment Figures

Distribution analysis of data lengths and tool call counts

Main Takeaways

ToolMind significantly improves multi-turn and agentic capabilities (BFCL Multi-Turn, Tau-bench) compared to base models.
Turn-level quality filtering is critical; removing it leads to noticeable performance drops, proving that 'correct' trajectories can still contain harmful noise.
Combining synthesized graph-based data with augmented open-source data yields the best overall performance, suggesting complementary benefits.

📚 Prerequisite Knowledge

Prerequisites

Function Calling (FC) paradigm in LLMs
Supervised Fine-Tuning (SFT)
Multi-Agent interaction patterns

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset to adapt it for a particular task

BFCL: Berkeley Function Calling Leaderboard—a benchmark evaluating LLMs' ability to invoke functions correctly across various scenarios

Tau-bench: A benchmark focusing on sustained dialogue and function-calling tasks in realistic user-agent scenarios

Function Graph: A directed graph where nodes are functions and edges represent semantic compatibility between one function's output and another's input

Turn-level filtering: Quality control process that validates individual steps within a conversation, rather than just the final outcome

Trajectory: The complete sequence of interactions (turns) between a user and an agent to solve a specific task