Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

📝 Paper Summary

Multi-turn w. user interactions Tool-use post-training

Magnet synthesizes high-quality multi-turn tool-use training data by traversing function dependency graphs and applying node operations (Insert, Merge, Split) to simulate complex conversational challenges.

Core Problem

Current LLMs struggle with complex multi-turn function calling interactions, specifically handling nested calls, long-term dependencies, and missing information, due to a lack of high-quality training trajectories.

Why it matters:

Existing public models achieve only ~10% success rates on complex multi-turn benchmarks compared to nearly 50% for proprietary models.
Simple back-translation methods for data synthesis fail to capture the structural complexity of real-world multi-turn dialogues (e.g., clarification questions, nested dependencies).
Training data scarcity limits the ability of open-weights models to bridge the gap with frontier models in agentic tasks.

Concrete Example: A user asks to 'check distance from SF to San Mateo in km'. A standard model might call 'get_distance' returning miles but fail to call 'convert_unit'. Magnet introduces 'Insert' operations to force the generation of nested calls where the second function (conversion) is implicit.

Key Novelty

Graph-based Multi-turn Data Synthesis with Context Distillation

Models function interactions as a 'local dependency graph' where edges represent input/output dependencies, allowing random walks to form realistic multi-turn function sequences.
Applies graph node operations (Insert, Merge, Split) to explicitly manufacture difficult scenarios like nested calls, parallel execution, and missing parameters.
Uses a teacher model to distill reasoning into positive trajectories (via correct hints) and negative trajectories (via misleading hints based on known error patterns) for preference optimization.

Architecture

The complete Magnet pipeline: from Graph Construction and Random Walk to Node Operations, Back-and-Forth Translation, and final Trajectory Synthesis/Distillation.

Evaluation Highlights

Magnet-14B-mDPO achieves 68.01% success rate on the Berkeley Function Calling Leaderboard (BFCL-v3), surpassing its teacher model Gemini-1.5-pro-002 (66.09%).
+32.5 point improvement over the base Qwen2.5-Coder-14B-Instruct model on BFCL-v3 multi-turn test cases.
Achieves 73.30% on ToolQuery benchmark, outperforming the teacher model (71.70%) and establishing strong generalization.

Breakthrough Assessment

8/10

Significant methodology for synthetic data generation that pushes open models past proprietary teachers on complex benchmarks. The graph-based construction addresses structural reasoning gaps efficiently.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn function calling (tool use) in conversational agents

Inputs: User query sequence Q and available Function Library F

Outputs: Sequence of model actions A (reasoning and function calls) and tool outputs T

Pipeline Flow

Function Graph Construction: Build local dependency graphs connecting relevant APIs
Path Sampling & Augmentation: Random walk to get initial path -> Apply Node Ops (Insert/Merge/Split)
Back-and-Forth Translation: Convert Function Signature Paths (FSP) to Queries and Executable Calls
Trajectory Synthesis: Teacher generates positive/negative conversation traces using hints
Training: SFT followed by mDPO

System Modules

Graph Constructor (Data Synthesis)

Organize 5,011 APIs into dependency graphs where edges represent input-output compatibility

Model or implementation: Gemini-1.5-pro-002 (Assistant)

Node Operator (Data Synthesis)

Modify function paths to inject complexity: Insert (nested/implicit calls), Merge (parallel calls), Split (missing info)

Model or implementation: Algorithmic / Heuristic

Trajectory Generator (Data Synthesis)

Generate full conversation history (User Query, Model Action, Tool Output)

Model or implementation: Gemini-1.5-pro-002 (Teacher)

Student Agent

Execute function calling in multi-turn conversations

Model or implementation: Qwen2.5-Coder-14B / 32B

Novel Architectural Elements

Graph-based FSP construction: structuring function calls as graph traversals to model dependencies explicitly
Node operations (Insert, Merge, Split) acting as semantic modifiers to create specific conversational challenges (nested, parallel, ambiguity)

Modeling

Base Model: Qwen2.5-Coder-14B-Instruct (and 32B variant)

Training Method: Supervised Fine-Tuning (SFT) followed by Multi-turn Direct Preference Optimization (mDPO)

Objective Functions:

Purpose: Maximize likelihood of correct actions in positive trajectories.

Formally: L_SFT(x; tau_w)
Purpose: Optimize preference for positive over negative trajectories while staying close to reference.

Formally: L_mDPO = -log sigma(eta * (sum log(pi_theta/pi_ref) over positive - sum log(pi_theta/pi_ref) over negative))

Trainable Parameters: Full fine-tuning

Training Data:

34,000 SFT trajectories
4,556 preference pairs for mDPO
Data includes single-turn, multi-turn, and irrelevant function scenarios

Key Hyperparameters:

lambda: Weight for mDPO loss (value not explicitly in text, symbolic in eq)
eta: mDPO learning rate multiplier (symbolic in eq)

Compute: Not reported in the paper

Comparison to Prior Work

vs. APIGen: Magnet uses graph-based sampling and node operations for multi-turn dependency modeling rather than independent single-turn verification.
vs. Toolbench: Magnet employs context distillation with positive/negative hints for DPO, rather than just SFT on successful paths.
vs. Toolformer: Magnet targets complex multi-turn/nested interactions explicitly via graph operations, not just single API insertion.

Limitations

Dependency on proprietary teacher model (Gemini-1.5-pro-002) for data synthesis quality.
Graph construction requires initial overhead of identifying dependencies between thousands of APIs.
Performance gains might be bounded by the teacher model's capability in the distillation phase (though student surpassed teacher here).

Reproducibility

Code availability is not provided. API pool derived from StableToolBench and BFCL-v3. Prompts for synthesis are provided in Appendix A.

📊 Experiments & Results

Evaluation Setup

Function calling capability in multi-turn and complex scenarios

Benchmarks:

Berkeley Function Calling Leaderboard (BFCL-v3) (Comprehensive function calling (Single, Multi-turn, Nested))
ToolQuery (Complex intent queries requiring reasoning)

Metrics:

Success Rate (Acc)
AST (Abstract Syntax Tree) Match
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Magnet-14B-mDPO achieves state-of-the-art results among open models on BFCL-v3, significantly improving over its base model and surpassing its teacher.
BFCL-v3 (Overall)	Success Rate	66.09	68.01	+1.92
BFCL-v3 (Multi-turn)	Success Rate	13.64	46.14	+32.50
ToolQuery	Success Rate	71.70	73.30	+1.60
Ablation studies confirm the value of mDPO and the graph-based data synthesis components.
BFCL-v3 (Overall)	Success Rate	66.30	68.01	+1.71

Experiment Figures

Illustrates the three main challenges in multi-turn function calling (Nested FCs, Long Dependency, Irrelevance) and how Magnet addresses them.

Main Takeaways

Graph-based synthesis effectively targets multi-turn weaknesses: The specific node operations (Insert, Merge, Split) correlate with massive gains in multi-turn performance (+32.5%).
Student surpasses Teacher: The 14B student model outperforms the Gemini-1.5-Pro teacher, suggesting the synthesized data curation removes noise or better aligns the model to the task format.
mDPO adds value over SFT: Direct Preference Optimization further refines the model, particularly when negative trajectories are constructed from realistic hard negatives (mistakes made by the SFT model).

📚 Prerequisite Knowledge

Prerequisites

Function Calling / Tool Use paradigms
Direct Preference Optimization (DPO)
Synthetic Data Generation via LLMs
Graph Theory basics (Nodes, Edges, Random Walks)

Key Terms

FSP: Function Signature Path—a sequence of function names and documentations representing the ground truth plan for a multi-turn interaction

mDPO: Multi-turn Direct Preference Optimization—an RLHF technique optimizing a model based on preference pairs (positive vs. negative trajectories) over multiple conversational turns

SFT: Supervised Fine-Tuning—training a model on labeled examples (queries and correct actions) to establish baseline capability

local dependency graph: A graph structure where functions are nodes and directed edges exist if the source function's output can serve as the target function's input

BFCL-v3: Berkeley Function Calling Leaderboard version 3—a comprehensive benchmark for evaluating LLM tool-use capabilities

context distillation: A technique to transfer knowledge from a teacher to a student by conditioning the teacher on 'hints' (e.g., ground truth function calls) to generate high-quality responses

back-and-forth translation: An iterative process where function signatures are translated into user queries (back) and then queries are translated into executable function calls (forth)