ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

📝 Paper Summary

Synthetic data generation for agents Multi-turn agentic interaction

ToolACE-MT generates high-quality multi-turn agentic dialogue data efficiently by first creating a coarse trajectory skeleton and then iteratively refining it with complexity injections, avoiding costly autoregressive multi-agent simulations.

Core Problem

Existing methods for generating multi-turn agentic data rely on autoregressive multi-agent simulations (MAS), which are computationally expensive, hard to control for complexity, and prone to error accumulation due to lack of global context.

Why it matters:

High-quality multi-turn data is essential for training agents to handle complex real-world tasks involving partial observability and dependent tool calls
Autoregressive generation is slow and costly because every turn requires a new inference step based on growing context
Assistants in standard simulations lack holistic awareness of the full task plan, leading to inconsistencies and factual errors in long horizons

Concrete Example: In a standard multi-agent simulation, an assistant might call a tool to book a flight without realizing the return date in a later subtask makes the itinerary impossible, because it generates one step at a time. ToolACE-MT plans the full skeleton first, ensuring the dates align before filling in the dialogue.

Key Novelty

Non-Autoregressive Iterative Generation for Agentic Data

Decouples structure from content: Generates a complete dialogue skeleton (user tasks + tool actions) first, then fills in natural language details, unlike standard methods that generate them simultaneously turn-by-turn
Iterative Refinement via Mask-and-Fill: Systematically injects complexity (e.g., user errors, clarifications) into the skeleton by masking specific turns and regenerating them, similar to non-autoregressive translation methods
Global planning consistency: By generating the action trajectory upfront based on a plan, the assistant's behavior remains consistent across long horizons

Architecture

The overall workflow of ToolACE-MT, illustrating the three stages: Initialization, Iterative Refinement, and Offline Verification.

Evaluation Highlights

Models trained on ToolACE-MT data outperform those trained on autoregressive MAS data on benchmarks like BFCL-v3, τ-bench, and ACEBench
Efficient scaling: The iterative refinement process allows flexible complexity scaling without the linear cost increase of full autoregressive regeneration
Data analysis confirms the generation pipeline produces diverse and valid agentic trajectories suitable for training tool-use LLMs

Breakthrough Assessment

8/10

Offers a significant paradigm shift from expensive autoregressive simulation to efficient non-autoregressive generation for agentic data, addressing key bottlenecks in cost and controllability.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where an assistant interacts with a user/environment to solve multi-step tasks

Inputs: A set of user tasks U and a tool library

Outputs: A multi-turn conversational trajectory C consisting of alternating user observations/queries and assistant actions/responses

Pipeline Flow

Task Initialization (Tool sampling & Subtask planning)
Trajectory Skeleton Generation (Action sequence creation)
Iterative Refinement (Complexity injection & Reasonability checks)
Offline Verification (Rule & Model-based filtering)

System Modules

Task Initializer (Initialization)

Samples tools and generates a high-level plan consisting of subtasks, required tools, and step counts

Model or implementation: GPT-4o-2024-11-20

Skeleton Generator (Initialization)

Generates the initial coarse conversational trajectory by composing subtask trajectories sequentially

Model or implementation: GPT-4o-2024-11-20

Refiner (Complexity Injection) (Iterative Refinement)

Injects realistic difficulties (clarifications, tool errors, unsupported tasks) via mask-and-extend operations

Model or implementation: GPT-4o-2024-11-20

Refiner (Reasonability) (Iterative Refinement)

Enhances logical consistency and coherence by randomly masking and regenerating turns

Model or implementation: GPT-4o-2024-11-20 (Generator & Judger)

Verifier

Filters invalid trajectories using rules and model-based checks

Model or implementation: Hybrid (Rule-based + LLM checking experts)

Novel Architectural Elements

Non-autoregressive turn-level generation pipeline: Separating trajectory planning (skeleton) from content generation (refinement)
Iterative Mask-and-Extend refinement strategy applied to agentic dialogue trajectories

Modeling

Base Model: LLaMA3.1-8B-Instruct (base model for training experiments)

Training Method: Supervised Fine-Tuning (SFT) on synthesized data

Training Data:

8000 training instances constructed via ToolACE-MT
8000 training instances constructed via MAS (baseline)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MAS: ToolACE-MT generates the whole skeleton first then refines, whereas MAS generates turn-by-turn autoregressively
vs. MAS: ToolACE-MT allows explicit control over task complexity via injection modules, whereas MAS complexity is implicit
vs. Standard SFT: ToolACE-MT synthesizes complex multi-turn data rather than using existing static datasets
+ 1 more
vs. Prabhakar et al. (2025): ToolACE-MT uses non-autoregressive refinement for the full trajectory, while Prabhakar et al. use MAS in their second stage [not cited in paper as direct baseline, but discussed in related work]

Limitations

Depends on a strong proprietary model (GPT-4o) for the generation and refinement stages
Skeleton initialization may still miss subtle dependencies that only emerge during full dialogue generation
Offline verification cannot catch all semantic inconsistencies, especially in very long contexts

Reproducibility

The paper does not explicitly provide a code URL or repository. The generation model used is GPT-4o-2024-11-20. The base model for training is LLaMA3.1-8B-Instruct. Benchmarks used (BFCL-v3, τ-Bench, ACEBench) are public.

📊 Experiments & Results

Evaluation Setup

Fine-tuning LLaMA3.1-8B-Instruct on synthesized data and evaluating on agentic benchmarks

Benchmarks:

BFCL-v3 (Function Calling Leaderboard)
τ-Bench (Agentic Multi-turn Interaction)
ACEBench (Agentic Capability Evaluation)

Metrics:

Performance metrics specific to each benchmark (likely Accuracy/Success Rate, though explicit metric names for results are general 'outperform')
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper states that models trained with ToolACE-MT data outperform those trained with MAS data, but does not provide specific numeric tables in the provided text snippet. The snippet claims 'Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation' and mentions specific benchmarks, but the actual result numbers are not in the provided content.

Experiment Figures

Comparison between Autoregressive Multi-Agent Simulation (MAS) and ToolACE-MT's Non-Autoregressive approach.

Illustration of the Iterative Refinement process using Mask-and-Fill.

Main Takeaways

Models trained on ToolACE-MT data outperform baselines trained on autoregressive MAS data across multiple benchmarks (BFCL-v3, τ-Bench, ACEBench).
The iterative refinement strategy effectively increases data complexity and quality, contributing to better downstream model performance.
The method is generalizable across different backbone models (e.g., LLaMA, Qwen - mentioned as 'more experiments show generalizability').
ToolACE-MT provides a more controllable generation process compared to implicit complexity in MAS.

📚 Prerequisite Knowledge

Prerequisites

Agentic AI and Tool Use (Function Calling)
Non-Autoregressive Generation (NAT)
Multi-Agent Simulation (MAS) for data synthesis

Key Terms

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment

Non-Autoregressive Generation: Generating a sequence (like a sentence or dialogue) in parallel or iteratively, rather than one token/turn strictly after another

Mask-and-Fill: A technique where parts of a sequence are hidden (masked) and a model predicts the missing content, used here to refine dialogue turns

MAS: Multi-Agent Simulation—using multiple LLMs playing different roles (user, assistant, tool) to generate conversation data by interacting with each other

Trajectory Skeleton: A structural outline of a dialogue containing the sequence of actions and observations but lacking full natural language detail

BFCL: Berkeley Function Calling Leaderboard—a benchmark for evaluating the ability of LLMs to call functions correctly

ACEBench: A benchmark for evaluating agentic capabilities

Autoregressive: Generating output one step at a time, where each step depends on the previous ones