SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

📝 Paper Summary

Multi-call tool use with flexible plan Benchmark

SynthTools is a scalable framework that generates, simulates, and audits thousands of diverse synthetic tools to create reliable environments for training and evaluating tool-use agents.

Core Problem

Training tool-use agents requires large-scale, diverse environments, but real-world APIs suffer from access limits, rate quotas, and instability, while existing hand-crafted benchmarks are too small.

Why it matters:

Real APIs (e.g., RapidAPI) are impractical for large-scale training due to cost, authentication requirements, and frequent deprecation
Current benchmarks like ACEbench and τ-bench cover very few domains (e.g., 2-8), limiting the generalization capability of agents trained on them
Without a scalable source of reliable tools, researchers cannot rigorously test agents on long-horizon planning or complex compositional reasoning

Concrete Example: Directly prompting ChatGPT to generate tools yields trivial outputs like 'robotics.create_task'. In contrast, a real flight booking scenario needs complex state management (checking seat availability before booking), which static generation fails to simulate reliably.

Key Novelty

Hierarchical Domain Evolution for Synthetic Tools

Uses a structured top-down generation pipeline: Field → Subdomain → Task → Tool, ensuring tools are grounded in realistic workflows rather than being random functions
Decouples simulation into 'Parameter Validation' (gateway emulation) and 'Response Generation' (state-dependent logic) to ensure high reliability without real backend code

Architecture

The hierarchical domain evolution procedure for generating tools

Evaluation Highlights

Generated ~6,000 synthetic tools spanning 100 distinct domains, exceeding prior work by >2× in both domains and tools per domain
Tool Simulation module achieves 94% accuracy in faithfully emulating tool responses across varied test cases (verified by human and LLM judges)
Tool Audit module achieves 99% accuracy in identifying incorrect simulator behaviors, ensuring the final toolset is highly reliable

Breakthrough Assessment

8/10

Significantly scales up the availability of diverse, reliable tools for agent training, solving a major bottleneck (API scarcity/instability). The high reliability of simulation makes it a viable substitute for real APIs.

⚙️ Technical Details

Problem Definition

Setting: Generation of a synthetic tool ecosystem T comprising tool specifications, a response simulator S(t, args, metadata), and an auditor A to verify correctness

Inputs: Seed fields (e.g., 'healthcare', 'finance')

Outputs: A set of verified tools, their simulators, and generated tasks requiring tool use

Pipeline Flow

Tool Generation Module (Domain → Tools)
Tool Simulation Module (Calls → Responses)
Tool Audit Module (Verifies Responses)

System Modules

Tool Generation

Synthesize diverse tool specifications from seed domains

Model or implementation: LLM (Specific model not stated, likely GPT-4 class based on context)

Tool Simulation

Emulate API behavior including parameter validation and state-dependent logic

Model or implementation: LLM (Prompted as simulator)

Tool Audit

Verify the correctness of simulated responses to filter out unreliable tools

Model or implementation: LLM (Judge)

Novel Architectural Elements

Hierarchical domain evolution pipeline that derives tools from practitioner workflows rather than random generation
State-aware simulation mechanism that separates schema validation from logic deduction using explicit metadata

Modeling

Base Model: Not explicitly specified (generic LLM used for generation/simulation)

Reproducibility

Code: https://github.com/ny2336/SynthTools

📊 Experiments & Results

Evaluation Setup

Validation of the tool generation pipeline's scale/diversity and the simulator's reliability

Benchmarks:

ACEBench (Tool use simulation (used here to test simulator fidelity against ground truth))
SynthTools Internal Benchmark (Synthetic tool generation and simulation) [New]

Metrics:

Simulation Accuracy (Agreement with ground truth/rules)
Audit Accuracy (Ability to detect errors)
Diversity (Number of fields/tools)
Statistical methodology: Manual inspection and LLM-as-a-judge verification

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scale and diversity analysis showing SynthTools significantly exceeds prior hand-crafted or API-scraped baselines.
N/A (Dataset statistics)	Number of Fields	8	100	+92
N/A (Dataset statistics)	Tools per Field	500	1000	+500
Reliability experiments validating the Tool Simulation module against both SynthTools-generated tools and external benchmarks.
SynthTools Internal	Accuracy	N/A	97	N/A
ACEBench	Accuracy	100	94	-6
SynthTools Internal	Accuracy	N/A	99	N/A

Experiment Figures

Comparison of Scale and Diversity vs. Prior Work (Scatter plot)

t-SNE visualization of tool embeddings within the E-commerce domain

Main Takeaways

The hierarchical generation pipeline successfully creates diverse tools; embedding-based deduplication removed only 9% of tools, indicating 91% uniqueness
The LLM-based simulator is highly reliable (94% accuracy on ACEBench), making it a feasible replacement for hard-coded sandboxes
Even state-of-the-art models struggle with tasks generated from these tools, confirming they present a meaningful challenge (though exact agent performance numbers are not the primary focus)

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool use (function calling)
Basic knowledge of API structures (parameters, schemas, HTTP status codes)
Familiarity with agent evaluation benchmarks

Key Terms

Synthetic tools: AI-generated specifications of software functions (APIs) that do not exist in the real world but mimic realistic interfaces

Tool Simulation: The process of generating a valid response (output) for a tool call given specific input arguments and state metadata, without executing real code

Tool Audit: An automated verification step using a 'Judge LLM' to check if the simulated tool response matches the expected behavior defined in the tool specification

Metadata: Contextual information (e.g., a database of available flights) used by the simulator to determine state-dependent outcomes (e.g., booking success vs. failure)

Deduplication: Removing tools that are semantically too similar to ensure diversity in the generated dataset

Schema validation: Checking if input parameters match the required data types and constraints (e.g., string vs. integer, required fields present)