← Back to Paper List

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Aili Chen, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua Xiao
Fudan University, MiniMax
arXiv (2026)
Agent RL Reasoning Benchmark

📝 Paper Summary

Synthetic Data Generation Agentic Tool Use Generalization
Dive improves agent generalization by inverting the synthesis process: it executes diverse real-world tools first to generate grounded evidence, then reverse-derives verifiable tasks from the resulting traces.
Core Problem
Current tool-using agents struggle to generalize to new tasks and toolsets because synthetic training data is confined to narrow templates and fixed tool combinations (e.g., only web search).
Why it matters:
  • Agents trained on rigid routines (e.g., search-browse loops) fail when faced with open-ended diversity in real-world deployments
  • Existing synthesis methods cannot scale diversity without sacrificing validity: manual pipelines are costly, while simulated environments often yield unverifiable or unsolvable tasks
Concrete Example: An agent trained primarily on web search tasks may over-rely on a 'search-then-browse' routine. When asked to perform clinical diagnosis using a specific 'PatientLookup' tool, it fails to adapt its pattern, leading to negative transfer.
Key Novelty
Evidence-Driven Inverted Synthesis
  • Inverts the standard 'Query first, Check later' synthesis order. Instead, it executes random tool combinations first to create a valid 'evidence' trace, then writes a question that this trace answers.
  • Ensures 'grounding by construction': because the task is derived from an actual successful tool execution, the task is guaranteed to be solvable and verifiable.
Evaluation Highlights
  • +22 average points improvement across 9 Out-of-Distribution (OOD) benchmarks compared to baselines when training Qwen3-8B on Dive data
  • Outperforms the strongest 8B baseline by +68% on tool-use generalization tasks
  • Diversity scaling proves more effective than quantity scaling: Dive data achieves better OOD generalization even with 4x less data than quantity-focused baselines
Breakthrough Assessment
9/10
The 'inverted synthesis' approach elegantly solves the tension between diversity and executability in synthetic data. Significant gains (+68%) suggests this is a major step forward for generalizable agents.
×