TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

📝 Paper Summary

Tool-use post-training Synthetic data generation

Toucan synthesizes a massive dataset of 1.5 million agent trajectories by interacting with nearly 500 real-world Model Context Protocol servers to train LLMs in complex, multi-turn tool use.

Core Problem

Open-source tool-agent development is hindered by a lack of high-quality, permissively licensed training data that captures the complexity of real-world tool interactions.

Why it matters:

Existing datasets lack tool diversity, authentic tool responses, or multi-turn complexity, limiting agent capability in production environments
Current approaches often rely on simulated toolsets or simple single-turn interactions, failing to prepare agents for edge cases and failures

Concrete Example: Previous datasets might simulate a weather API call with a hallucinated JSON response. Toucan connects to a real Weather MCP server, executes the actual tool, captures the real API output (including errors), and generates a trajectory based on that ground truth.

Key Novelty

Toucan Dataset & Pipeline

Leverages the Model Context Protocol (MCP) to standardize connections to diverse real-world tools (filesystems, databases, APIs) rather than custom implementations
Uses a 'self-simulation' pipeline where teacher models generate tasks based on real MCP specs, execute them against live servers, and filter results based on execution success
Introduces extension mechanisms for 'irrelevance' (rejecting unsolvable queries) and multi-turn dialogue to simulate realistic user-agent interactions

Architecture

The complete Toucan data construction pipeline from MCP server onboarding to final trajectory filtering and extensions.

Evaluation Highlights

Toucan-tuned models achieve state-of-the-art performance on the MCP-Universe benchmark, consistently outperforming leading models of comparable size
Outperforms larger closed-source models on BFCL V3 benchmark in function calling accuracy across single and multi-turn scenarios
Demonstrates substantial improvements on τ-Bench and τ²-Bench in tool selection, execution fidelity, and multi-turn reasoning

Breakthrough Assessment

8/10

Significantly scales up open-source agent training data using real-world protocols (MCP) rather than simulations. The reliance on live execution for ground truth is a strong differentiator.

⚙️ Technical Details

Problem Definition

Setting: Supervised fine-tuning of LLMs for agentic tool use (function calling and reasoning)

Inputs: User query plus a set of available tools defined by MCP specifications

Outputs: A trajectory of thought (reasoning), tool calls, and final response

Pipeline Flow

MCP Server Onboarding (Filter & Test)
Task Synthesis (Generate Questions)
Task Filtering (Quality Check)
Trajectory Generation (Agent Execution)
Post-Filtering (Validation)

System Modules

MCP Server Onboarding

Filter raw MCP servers for streamable HTTP support and no-auth requirements; validate with test questions

Model or implementation: None (Rule-based + Test Execution)

Task Synthesis (Generation)

Generate diverse user queries (tasks) based on server capabilities (Single, Multi, and Featured strategies)

Model or implementation: Mistral-Small, DevStral-Small, GPT-OSS, Kimi-K2, Qwen3-32B

Task Filtering (Filtering)

Annotate and filter tasks based on difficulty, realism, and clarity

Model or implementation: Kimi-K2 (selected for correlation/cost balance)

Trajectory Generation (Generation)

Execute tasks using teacher agents against live MCP servers to record full interaction traces

Model or implementation: GPT-OSS-120B, Kimi-K2, Qwen3-32B (via Qwen-agent and OpenAI-agent frameworks)

Post-Filtering (Filtering)

Validate tool execution success, sequence correctness, and answer quality

Model or implementation: Rule-based scripts + GPT-OSS-120B (Judge)

Novel Architectural Elements

Integration of live MCP (Model Context Protocol) servers directly into the data generation loop to provide real execution feedback
Three-stage extension mechanism (Irrelevance, Persona-based, Multi-turn self-simulation) to diversify core trajectories

Modeling

Base Model: Evaluated on multiple architectures (e.g., Llama, Qwen); paper focuses on data generation pipeline rather than a specific model architecture

Training Method: Supervised Fine-Tuning (SFT) on Toucan dataset

Training Data:

1.5 million trajectories total
Generated from ~500 real-world MCP servers
Includes Single-turn, Multi-turn, and Edge-case subsets

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolLLM: Toucan uses real MCP server execution for ground truth responses vs. LLM simulation
vs. APIGen: Toucan scales to 1.5M trajectories with multi-turn interactions vs. APIGen's focus on single-turn diversity
vs. Glaive [not cited in paper]: Toucan focuses on general tool use via MCP vs. Glaive's focus on specific function calling tasks

Limitations

Dependency on the availability and stability of remote MCP servers during generation
Potential bias from the specific teacher models (GPT-OSS, Kimi, Qwen) used for synthesis
Requires filtering of servers requiring credentials, which may exclude high-value enterprise tools
No specific training compute or cost details reported for the fine-tuning experiments

Reproducibility

The paper describes the pipeline in detail. The dataset is explicitly named 'Toucan' and described as 'publicly available', though the specific URL is not in the text (likely in footnotes/abstract in final version). MCP server sources are public (Smithery, GitHub). Prompts are provided in Appendix D.

📊 Experiments & Results

Evaluation Setup

Fine-tuning models on Toucan data and evaluating on agentic benchmarks

Benchmarks:

BFCL V3 (Function Calling Accuracy)
MCP-Universe (Realistic MCP Tool Execution)
τ-Bench (Tau-Bench) (User-Agent-Tool Interaction)
τ²-Bench (Complex Agentic Reasoning)

Metrics:

Function Calling Accuracy
Tool Selection Efficiency
Execution Fidelity
Multi-turn Reasoning Success
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Training on real execution data (MCP) significantly improves robustness compared to simulated data
Multi-turn data generation is critical for performance on conversational benchmarks like τ-Bench
Toucan-trained models push the Pareto frontier on MCP-Universe, balancing accuracy and efficiency better than baselines

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool use / function calling
Familiarity with agentic workflows (Reason-Act loops)
Basic knowledge of synthetic data generation pipelines

Key Terms

MCP: Model Context Protocol—an open standard that enables LLMs to connect to external data and tools (servers) via a standardized interface

Trajectory: A sequence of steps taken by an agent, typically including reasoning (thought), a tool call, the tool's output, and a final answer

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset to improve its performance on a target task

BFCL: Berkeley Function Calling Leaderboard—a benchmark for evaluating the ability of LLMs to invoke software functions correctly

Edge case: Rare or difficult scenarios, such as when a requested tool is unavailable or returns an error

Hallucination: When an LLM generates incorrect or fabricated information, such as inventing a tool that doesn't exist

Pareto frontier: The set of optimal solutions where no improvement can be made to one objective without degrading another; used here to describe trade-offs in benchmark performance