Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

📝 Paper Summary

Synthetic data generation for agents Tool-use post-training

GEM extracts multi-turn tool-use trajectories from raw text corpora by identifying implicit workflows, synthesizing corresponding tools, and generating grounded user-agent interactions without requiring predefined APIs.

Core Problem

Training autonomous agents requires diverse, realistic multi-turn tool-use data, but existing methods rely on expensive, limited sets of predefined APIs, restricting generalization.

Why it matters:

Real-world tool-use trajectories are scarce and hard to collect manually.
Simulation methods based on fixed API sets fail to cover the broad range of scenarios needed for agents to generalize to unseen environments.
Current LLMs struggle with realistic multi-turn interactions involving ambiguous instructions or long-context dependencies.

Concrete Example: A raw text about 'hospital reimbursement claims' contains an implicit procedure (step-by-step logic) and implicit tools (forms, submissions) but is not structured as an agent trajectory. Existing methods miss this data source, whereas GEM extracts the workflow to create a simulated reimbursement agent.

Key Novelty

Text-Based Extraction Paradigm (GEM Pipeline)

Treats raw text corpora as a source of implicit 'experience' containing procedural knowledge, rather than just knowledge facts.
Synthesizes *both* the tools (APIs) and the interaction trajectories simultaneously from text, bypassing the need for a pre-existing tool library.
Distills the multi-stage generation pipeline into a single 'Trajectory Synthesizer' model that converts text to agent data end-to-end.

Architecture

The GEM data synthesis pipeline, illustrating the flow from raw text to validated trajectories.

Evaluation Highlights

+16.5% improvement on the BFCL V3 Multi-turn benchmark using GEM-32B compared to baselines.
Achieves comparable performance on Tau-bench (Airline and Retail) using out-of-domain synthetic data as models trained on in-domain data, showing strong generalization.
The distilled Trajectory Synthesizer matches the quality of the full multi-stage pipeline while significantly reducing inference costs.

Breakthrough Assessment

8/10

Proposes a significant paradigm shift from tool-centered simulation (needing fixed APIs) to text-centered extraction (creating APIs from text). The generalization results on Tau-bench are particularly impressive.

⚙️ Technical Details

Problem Definition

Setting: Synthesizing a tool set P and a multi-turn trajectory T = {system_prompt, (user, assistant, observation)...} from an unstructured text segment c.

Inputs: Raw, unstructured text segment c from a large corpus (e.g., UltraFineWeb) containing multi-step workflows.

Outputs: A list of tool definitions P and a structured multi-turn conversation T reflecting the workflow in the text.

Pipeline Flow

Selection (Filtering) -> Extraction (Workflow/Tool) -> Generation (Trajectory) -> Refinement (Complexity) -> Validation
Distilled Synthesizer: Input Text -> [Trajectory Synthesizer Model] -> Tool Defs + Trajectory

System Modules

Selection (Relevance Filtering) (Data Processing)

Identify text segments containing multi-step operational procedures.

Model or implementation: Qwen2.5-72B-Instruct (implied, or similar strong model for annotation)

Extraction (Workflow & Tool) (Data Processing)

Extract structured abstract workflows and design API tools based on the text.

Model or implementation: GLM-4

Generation (Data Processing)

Generate the initial multi-turn user-agent conversation based on workflows and tools.

Model or implementation: GLM-4

Refinement (Data Processing)

Enhance trajectory complexity (ambiguity, variety).

Model or implementation: Not explicitly specified, likely GLM-4

Validation (Data Processing)

Verify structural correctness and check for hallucinations.

Model or implementation: Rule-based scripts + Qwen2.5-32B-Instruct (Judge)

Novel Architectural Elements

Inversion of the standard agent data pipeline: Instead of starting with tools and simulating tasks, GEM starts with text and synthesizes *both* tools and tasks.
Distilled Trajectory Synthesizer: A single model trained to replace the 4-stage pipeline for efficient inference.

Modeling

Base Model: Qwen2.5-32B-Instruct and Qwen2.5-7B-Instruct (referred to as Qwen3 in paper text, likely a typo for Qwen2.5 based on context or a very new model)

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full parameter fine-tuning

Trainable Parameters: Full model

Training Data:

Source: UltraFineWeb
Size: 10K synthetic trajectories generated by GLM-4 via GEM pipeline

Key Hyperparameters:

learning_rate: 5e-6
epochs: 2
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. APIGEN-MT: GEM synthesizes tools from text rather than using predefined APIs, allowing for greater domain diversity.
vs. Simia: GEM achieves comparable performance using out-of-domain text data, whereas Simia relies on in-domain data.
vs. ToolBench [not cited in paper]: GEM derives tools implicitly from narrative text rather than collecting real APIs.

Limitations

Reliance on the quality of the 'teacher' model (GLM-4) for initial synthesis; biases in the teacher may propagate.
The refinement stage increases complexity but may introduce noise if not perfectly validated.
Evaluation is primarily on English benchmarks; multilingual performance is not explored.
The 'Qwen3' model name usage is ambiguous (likely refers to Qwen2.5 or a specific internal version).

Reproducibility

Code: https://github.com/RUC-GSAI/GEM

Code is publicly available at https://github.com/RUC-GSAI/GEM. The paper uses UltraFineWeb for source text. 10K synthetic trajectories were generated. Specific prompt templates are in Appendix A.

📊 Experiments & Results

Evaluation Setup

Evaluated on multi-turn tool-use benchmarks measuring function calling accuracy and end-to-end task completion.

Benchmarks:

BFCL V3 (Multi-turn function calling (Python-based API environment))
Tau-bench (Complex user-agent interaction in Retail and Airline domains)

Metrics:

Avg@4 (Tau-bench)
Pass@4 (Tau-bench)
Accuracy (BFCL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GEM models significantly outperform baselines on the BFCL V3 Multi-Turn benchmark, demonstrating robust function-calling capabilities.
BFCL V3 Multi-Turn	Average Score	46.22	61.16	+14.94
BFCL V3 Multi-Turn	Average Score	82.38	61.16	-21.22
On Tau-bench, GEM demonstrates strong generalization, performing comparably to models trained on in-domain data despite being trained on general text.
Tau-bench (Retail)	Avg@4	48.1	51.9	+3.8
Tau-bench (Airline)	Avg@4	47.8	48.2	+0.4

Experiment Figures

Comparison between the traditional Tool-Centered Simulation paradigm and the proposed Text-Based Extraction paradigm.

An example of a raw text segment and how its components (User Query, Environmental Tools, Workflow) map to agent concepts.

Main Takeaways

Text-based synthesis enables agents to generalize well to unseen domains (Tau-bench) without seeing domain-specific training data.
The refinement stage in the pipeline is crucial; raw generated trajectories are often too simple to provide robust training signal.
A distilled 'Synthesizer' model can effectively clone the multi-stage pipeline's capability, offering a scalable path for data generation.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting strategies
Tool-use/Function-calling in AI agents
Supervised Fine-Tuning (SFT) data construction

Key Terms

GEM: General-purpose Extraction of Multi-turn trajectories—the proposed pipeline for synthesizing agent data from text.

BFCL: Berkeley Function-Calling Leaderboard—a benchmark for evaluating LLM tool-use capabilities.

Tau-bench: A benchmark evaluating agents in realistic, complex domains like Airline and Retail with user simulators.

SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to a specific task.

SFT warm-start: Initial training using supervised data before applying other optimization techniques (though primarily SFT is used here).

Trajectory Synthesizer: A specialized model trained to convert text directly into tool-use trajectories, bypassing the multi-step pipeline during inference.

UltraFineWeb: A large-scale, high-quality open web dataset used as the source corpus for text segments.

GLM-4: A strong Large Language Model used as the 'teacher' to generate initial synthetic data.

Hallucination: When a model generates content (like tool parameters) not supported by the context or facts.