ToolACE: Winning the points of LLM function calling

📝 Paper Summary

Synthetic Data Generation for Agents Tool-use post-training

ToolACE automates the creation of high-quality tool-learning data by evolving diverse synthetic APIs and generating dialogs where complexity is dynamically adjusted based on the target model's current capability.

Core Problem

Existing synthetic data pipelines for function calling lack API diversity and often generate samples that are either too simple or too complex for specific models, hindering effective generalization.

Why it matters:

Real-world APIs are vast and rapidly changing, requiring models to generalize zero-shot rather than memorizing a small set of public APIs
Models learn best when data complexity slightly exceeds their current capability; static datasets fail to provide this tailored curriculum
Inaccurate or inconsistent synthetic data causes models to hallucinate parameters or misunderstand API constraints

Concrete Example: A simple 0.5B model may be overwhelmed by data requiring long dependencies between APIs, while a 70B model learns nothing from straightforward single-turn queries. Standard pipelines generate fixed datasets that fail to address this gap, resulting in unproductive training for both.

Key Novelty

Self-Evolving Tool Synthesis and Self-Guided Complexity

Tool Self-Evolution Synthesis (TSS): Synthesizes a massive pool of 26,507 diverse APIs by evolving them through a speciation-adaptation-evolution process rooted in an 'API context tree' derived from pre-training documents
Self-Guided Dialog Generation (SDG): Uses the target LLM itself as an evaluator to measure data complexity via loss; if a sample is too easy or hard, the generation agents adjust the query complexity dynamically
Dual-Layer Verification (DLV): Combines rule-based checks (syntax, parameters) with model-based checks (hallucination, consistency) to ensure high data quality without expensive human annotation

Architecture

The overall ToolACE pipeline, illustrating the flow from API synthesis to dialog generation and verification.

Evaluation Highlights

ToolACE-8B achieves 84.67% accuracy on the Berkeley Function Calling Leaderboard (BFCL), outperforming GPT-4-1106-Preview (83.25%) and Llama-3-8B-Instruct (77.83%)
On APIBank (Level-1), ToolACE-8B reaches 76.51% accuracy, surpassing GPT-3.5-Turbo (72.24%) and largely outperforming the base Llama-3-8B (51.86%)
In generalization tests (BFCL), ToolACE-8B outperforms the specialized ToolLLaMA-2-7B by a massive margin (84.67% vs 54.81%), demonstrating the value of diverse synthetic APIs

Breakthrough Assessment

9/10

Significantly advances synthetic data generation for agents by introducing evolutionary API synthesis and model-aware complexity control. The performance of an 8B model beating GPT-4 on specific benchmarks is a strong validation.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) for Function Calling

Inputs: User query x and a list of available tool definitions

Outputs: Response y containing correct API calls [t_1, ..., t_ny] or conversational text

Pipeline Flow

Tool Self-Evolution (Synthesize diverse APIs)
Self-Guided Dialog Generation (Generate conversation data)
Dual-Layer Verification (Filter bad data)

System Modules

Tool Self-Evolution Synthesis (TSS)

Generate a diverse pool of API definitions

Model or implementation: Frontier LLM (e.g., GPT-4 class) acting as generator

Complexity Evaluator (Dialog Generation)

Assess if a generated dialog is suitable for the target model

Model or implementation: Target LLM (the model to be finetuned, M)

Multi-Agent Generator (Dialog Generation)

Generate conversational data based on complexity feedback

Model or implementation: Three agents: User, Assistant, Tool (simulated by LLM)

Dual-Layer Verification (DLV)

Filter out invalid or low-quality data

Model or implementation: Rule Checker (Python scripts) + Model Checker (LLM agents)

Novel Architectural Elements

Self-guided complexity loop: The target model's loss is directly used as a feedback signal to the data generation agents to adjust the difficulty of subsequent samples during the data creation phase

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Minimize negative log-likelihood of the target tokens given the input.

Training Data:

Synthesized data using ToolACE pipeline
Includes single, parallel, dependent, and non-tool dialogs

Key Hyperparameters:

epochs: 2
batch_size: 128
learning_rate: 5e-6
+ 3 more
max_length: 8192
warmup_ratio: 0.03
weight_decay: 0.001

Compute: 8x H800 80G GPUs used for training

Comparison to Prior Work

vs. ToolLLaMA: ToolACE synthesizes its own APIs (TSS) rather than relying on existing RapidAPI data, leading to better zero-shot generalization and higher API diversity.
vs. Standard SFT (e.g., Glaive): ToolACE uses the target model's own loss to guide data complexity (SDG), ensuring the curriculum is neither too hard nor too easy.
vs. GPT-4: ToolACE-8B achieves comparable performance with significantly fewer parameters (8B vs estimated trillions/MoE) through higher quality data.

Limitations

The paper focuses mainly on 8B parameter models; scaling laws for larger models are not fully explored.
Reliance on a 'frontier LLM' for the initial API synthesis and agent simulation means the pipeline costs could be high.
Evaluation is limited to BFCL and APIBank; real-world deployment performance (e.g., latency, robustness to adversarial inputs) is not extensively tested.

Reproducibility

Code: https://huggingface.co/Team-ACE

Model and subset of data available at https://huggingface.co/Team-ACE. Code for the pipeline itself is not explicitly linked in the main text but the HuggingFace repository is provided. Training hardware (H800s) is high-end.

📊 Experiments & Results

Evaluation Setup

Evaluation on standardized function calling benchmarks measuring accuracy of API selection and parameter formatting.

Benchmarks:

Berkeley Function Calling Leaderboard (BFCL) (Diverse tool use (Java, JavaScript, Python, SQL, etc.))
APIBank (Tool-augmented LLM evaluation (Level-1 and Level-2))

Metrics:

Accuracy (Acc)
Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BFCL Performance: ToolACE-8B demonstrates state-of-the-art performance among open-source models and rivals proprietary models.
BFCL (Overall)	Accuracy	77.83	84.67	+6.84
BFCL (Simple)	Accuracy	81.93	91.87	+9.94
BFCL (Parallel)	Accuracy	81.18	90.06	+8.88
APIBank Performance: Significant improvements on both Level-1 (basic) and Level-2 (complex) tasks.
APIBank (Level-1)	Accuracy	51.86	76.51	+24.65
APIBank (Level-2)	Accuracy	34.02	47.93	+13.91
Ablation Studies: Validating the contribution of pipeline components (TSS, SDG, DLV).
BFCL (Overall)	Accuracy	81.01	84.67	+3.66
BFCL (Overall)	Accuracy	80.45	84.67	+4.22

Experiment Figures

Correlation between data characteristics (number of APIs, candidate APIs, difference score) and model loss (complexity).

Comparison of pass rates on BFCL for different model sizes and methods.

Main Takeaways

Diverse synthetic APIs (TSS) are crucial for zero-shot generalization; training only on existing public APIs (like in ToolLLaMA) limits performance on unseen tools.
Complexity matching (SDG) works: training on data that is 'just right' (not too easy/hard) for the specific model yields better results than random complexity.
The 8B model trained with ToolACE is competitive with or superior to GPT-4-1106-Preview on function calling benchmarks, suggesting data quality is more critical than model size for this specific capability.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and Instruction Tuning
Function Calling / Tool Use in Agents
Synthetic Data Generation pipelines

Key Terms

BFCL: Berkeley Function Calling Leaderboard—a comprehensive evaluation set for assessing LLM tool-use capabilities across various coding languages and scenarios

APIBank: A benchmark for evaluating tool-augmented LLMs, divided into levels of difficulty (Level-1 for single/simple calls, Level-2 for multi-turn/complex)

Zero-shot generalization: The ability of the model to use tools/APIs it has never seen during training, relying only on the provided definitions

Hallucination: In this context, when a model invents parameter values not present in the user query or system prompt

Speciation: The initial step in ToolACE's API synthesis where an 'API context tree' is created to define possible domains and functionalities from raw documents

Adaptation: The step where specific functionalities from the context tree are assigned to individual synthetic APIs to ensure distinct capabilities

Evolution: The iterative process of refining and diversifying synthetic APIs (e.g., adding constraints, mutating parameters) based on feedback

Model-based Checker: Using an LLM agent to verify semantic correctness (e.g., consistency, absence of hallucinations) where rule-based checks fail