DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

📝 Paper Summary

Synthetic Data Generation Agentic Tool Use Generalization

Dive improves agent generalization by inverting the synthesis process: it executes diverse real-world tools first to generate grounded evidence, then reverse-derives verifiable tasks from the resulting traces.

Core Problem

Current tool-using agents struggle to generalize to new tasks and toolsets because synthetic training data is confined to narrow templates and fixed tool combinations (e.g., only web search).

Why it matters:

Agents trained on rigid routines (e.g., search-browse loops) fail when faced with open-ended diversity in real-world deployments
Existing synthesis methods cannot scale diversity without sacrificing validity: manual pipelines are costly, while simulated environments often yield unverifiable or unsolvable tasks

Concrete Example: An agent trained primarily on web search tasks may over-rely on a 'search-then-browse' routine. When asked to perform clinical diagnosis using a specific 'PatientLookup' tool, it fails to adapt its pattern, leading to negative transfer.

Key Novelty

Evidence-Driven Inverted Synthesis

Inverts the standard 'Query first, Check later' synthesis order. Instead, it executes random tool combinations first to create a valid 'evidence' trace, then writes a question that this trace answers.
Ensures 'grounding by construction': because the task is derived from an actual successful tool execution, the task is guaranteed to be solvable and verifiable.

Evaluation Highlights

+22 average points improvement across 9 Out-of-Distribution (OOD) benchmarks compared to baselines when training Qwen3-8B on Dive data
Outperforms the strongest 8B baseline by +68% on tool-use generalization tasks
Diversity scaling proves more effective than quantity scaling: Dive data achieves better OOD generalization even with 4x less data than quantity-focused baselines

Breakthrough Assessment

9/10

The 'inverted synthesis' approach elegantly solves the tension between diversity and executability in synthetic data. Significant gains (+68%) suggests this is a major step forward for generalizable agents.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision process where an agent policy generates thoughts and actions (tool calls) to solve a task query using a specific toolset

Inputs: Task query Q and a toolset T

Outputs: A trajectory of reasoning and actions culminating in a final answer A

Pipeline Flow

Input Query & Toolset
Reasoning & Action Loop (Interleaved)
Final Answer Generation

System Modules

Agent Policy

Generates reasoning thoughts and selects tool actions based on history

Model or implementation: Qwen3-8B

Tool Environment

Executes selected tools and returns observations

Model or implementation: Real-world APIs (Python/Web)

Novel Architectural Elements

The primary novelty is not in the agent architecture (standard ReAct) but in the 'Evidence-Driven' data synthesis pipeline that trains it [not an inference architecture element]

Modeling

Base Model: Qwen3-8B

Training Method: Two-stage training: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)

Objective Functions:

Purpose: Maximize correctness of the final answer and validity of tool calls.

Formally: R = R_correct + R_format

Training Data:

Resource Preparation: 373 tools (Finance, Biology, Medicine, Academia, General), ~5000 seed concepts per domain, Query exemplars
Evidence Collection: Agent executes tools on random configurations to gather 'evidence' traces
Task Derivation: LLM reverse-derives (Q, A) pairs strictly entailed by the evidence traces
Dataset Size: 48k SFT samples + 3.2k RL samples

Key Hyperparameters:

sft_data_size: 48,000
rl_data_size: 3,200
tools_count: 373

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolAlpaca: Dive uses real executable tools instead of simulated ones to ensure validity
vs. Standard Synthesis: Dive inverts the order (Trace -> Task) to guarantee solvability, whereas standard methods (Task -> Trace) suffer from low executability
vs. Search-based methods: Dive scales across 5 domains and 373 tools, enabling structural diversity beyond search-browse loops
+ 1 more
vs. Trial-and-Error Synthesis [not cited in paper]: Dive avoids the computation waste of filtering failed trajectories by ensuring grounding by construction

Limitations

Reliance on the availability and stability of real-world APIs for the synthesis pipeline
The 'Qwen3' model designation implies a specific ecosystem or future context not fully detailed in the snippet
RL stage uses a relatively small dataset (3.2k) compared to SFT (48k)

Reproducibility

Code: https://sheep333c.github.io/DIVE/

Code and data are publicly available at https://sheep333c.github.io/DIVE/. The paper relies on Qwen3-8B (a model name that suggests a future or specific variant context). 373 tools are curated and validated.

📊 Experiments & Results

Evaluation Setup

Tool-use generalization across unseen tasks and toolsets

Benchmarks:

9 OOD Benchmarks (Diverse tool use (Clinical, Financial, etc.))

Metrics:

Pass Rate / Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
9 OOD Benchmarks (Average)	Average Score	Not reported in the paper	Not reported in the paper	+22 points
OOD Generalization	Performance	Lower	Higher	Positive

Main Takeaways

Dive significantly improves OOD generalization (+22 points), validating that structural diversity in training data prevents overfitting to rigid tool-use patterns.
Inverted synthesis (Evidence -> Task) effectively solves the validity bottleneck, allowing for scalable data generation without manual pipeline engineering.
RL training on diverse data further amplifies robustness compared to SFT alone.
Diversity is more data-efficient than quantity: smaller, diverse datasets outperform larger, homogenous ones.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool use (ReAct paradigm)
Familiarity with synthetic data generation for LLMs
Basic Reinforcement Learning concepts (SFT vs RL)

Key Terms

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) to learn a specific behavior

RL: Reinforcement Learning—training an agent to maximize a reward signal by interacting with an environment

OOD: Out-of-Distribution—tasks or data that differ significantly from what the model saw during training (e.g., new tools or domains)

Grounding: Ensuring that an AI's outputs are based on verifiable facts or real execution traces rather than hallucination

Inverted Synthesis: The process of generating the answer/execution trace first and then deriving the question, ensuring solvability

Topic Collapse: A failure mode in synthetic data generation where the model repeatedly produces the same few high-frequency concepts

Trajectory: The sequence of thoughts, actions (tool calls), and observations (tool outputs) an agent generates while solving a task