TaskCraft: Automated Generation of Agentic Tasks

📝 Paper Summary

Synthetic Data Generation Agentic AI Tool Use

TaskCraft is an automated workflow that generates scalable, multi-tool agentic tasks with execution trajectories by reversing tool logic (answer-to-question) and iteratively extending simple tasks into complex hierarchies.

Core Problem

Existing agentic benchmarks (like GAIA and HLE) rely on expensive, non-scalable human annotation, while current synthetic methods (like Self-Instruct) lack the dynamic tool interactions required for agentic tasks.

Why it matters:

Training advanced agents requires massive amounts of trajectory data that demonstrates tool use and adaptive reasoning
Static instruction-following data fails to model the dynamic environment interactions central to agentic workflows
Manual annotation of complex tasks (e.g., HLE required 1,000 experts for 2,500 points) is too costly to scale

Concrete Example: A standard LLM might generate a static question like 'What is the capital of France?', but fails to create a verifiable task requiring a PDF reader tool to extract a specific financial figure from a report and compare it to a live stock price.

Key Novelty

Reverse-Engineering Agentic Tasks (TaskCraft)

Constructs 'atomic' tasks by starting with a tool's output (the answer) and prompting an LLM to reverse-engineer the question and required input (the premise)
Expands difficulty recursively via 'depth-based' extension (finding prerequisites for the current input) and 'width-based' extension (merging independent sub-problems)
Verifies extended tasks using linguistic analysis rather than full agent execution, significantly reducing computational cost

Architecture

The complete TaskCraft workflow, illustrating the progression from unlabeled data to atomic tasks, and then to extended depth/width tasks with verification loops.

Evaluation Highlights

+14.0% average performance improvement for Qwen2.5-3B-Base on multi-hop QA benchmarks after SFT with TaskCraft data
Achieves +19.2% gain on Bamboogle and +6.2% on Musique compared to the Search-R1 baseline using Qwen2.5-3B-Base
Prompt optimization reduced atomic task generation time by 19.2% and improved pass rates from 54.9% to 68.1%

Breakthrough Assessment

8/10

Addresses the critical bottleneck of data scarcity for agentic AI. The reverse-generation and efficient verification pipeline enables scalable, high-quality synthetic data creation that demonstrably improves model performance.

⚙️ Technical Details

Problem Definition

Setting: Automated generation of tool-use tasks and execution trajectories

Inputs: Unlabeled corpus (Webpages, PDFs, Images)

Outputs: Agentic tasks (question q, answer a) and execution trajectories

Pipeline Flow

Corpus Processing: Extract potential tool inputs and content
Atomic Task Generation: Reverse-engineer questions from content
Task Extension: Apply depth/width strategies to increase complexity
Verification: Two-phase validation (Agentic for atomic, Linguistic for extended)

System Modules

Atomic Generator (Task Creation)

Create simple tasks by sampling answers from tool outputs and inferring the corresponding question

Model or implementation: LLM (e.g., GPT-4 or similar capable model)

Task Extender (Task Creation)

Iteratively complicate tasks by finding superset inputs (depth) or merging tasks (width)

Model or implementation: LLM

Verifier

Filter invalid tasks using agent execution (atomic) or logical consistency checks (extended)

Model or implementation: Judge-LLM + Infer-LLM

Novel Architectural Elements

Reverse-generation workflow: Deriving questions from answers/tool-outputs rather than generating questions first
Hybrid verification strategy: Using expensive agent-based verification only for atomic components and cheaper linguistic verification for structural extensions

Modeling

Base Model: Qwen2.5-3B-Base and Qwen2.5-3B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full model fine-tuning implied

Training Data:

3,202 synthesized multi-hop tasks and trajectories used for SFT experiments
Total dataset created: ~36,000 tasks

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Instruct: TaskCraft incorporates actual tool execution and environment interaction, generating trajectories rather than just text pairs
vs. GAIA: TaskCraft is fully automated and scalable, whereas GAIA relies on manual annotation
vs. ToolBench [not cited in paper]: ToolBench focuses on API breadth; TaskCraft focuses on structural complexity (depth/width) and verifiability via reverse generation

Limitations

Task failure rates increase significantly for complex modalities like PDFs and image understanding compared to web search
The approach relies on the assumption of an 'ideal search engine' to retrieve precise data during the generation phase
Verification of extended tasks relies on linguistic analysis, which might miss some execution-level edge cases compared to full agent verification

Reproducibility

Code: https://github.com/OPPO-PersonalAI/TaskCraft

Dataset of ~36k tasks and code are publicly available on GitHub. The paper details the prompt optimization strategy and the recursive formulation for task extension. Specific hyperparameters for the SFT process are referenced to follow the Search-R1 setup.

📊 Experiments & Results

Evaluation Setup

Agentic performance evaluation on multi-hop QA datasets using generated trajectories for SFT

Benchmarks:

HotpotQA (Multi-hop reasoning QA)
Musique (Multi-hop reasoning QA)
Bamboogle (Multi-hop reasoning QA)

Metrics:

Pass Rate (for generation)
Generation Time
Performance Gain (%) (for downstream agent SFT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Prompt optimization experiments demonstrate the self-evolving capability of the workflow, improving efficiency and success rates.
Internal Generation	Pass Rate	54.9	68.1	+13.2
Internal Generation	Generation Time (s)	29.1	23.5	-5.6
Downstream SFT experiments show that fine-tuning agent models on TaskCraft-generated trajectories significantly improves performance on established benchmarks.
Bamboogle	Performance Gain	0.0	19.2	+19.2
Musique	Performance Gain	0.0	6.2	+6.2

Experiment Figures

Task failure rates across different modalities (Web, PDF, Image) for generated tasks.

Main Takeaways

Synthetic data from TaskCraft effectively enhances supervised fine-tuning (SFT) for agentic models, showing gains across multiple datasets.
The 'reverse' generation method combined with depth/width extension allows for the creation of difficult tasks that challenge current agents (as seen in high failure rates for PDF/Image tasks).
Self-evolving prompt optimization significantly reduces the computational cost and increases the yield of the generation pipeline.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agents and tool use (ReAct framework)
Knowledge of Supervised Fine-Tuning (SFT)
Familiarity with Reinforcement Learning (RL) for language models

Key Terms

Atomic Task: A simple agentic task solvable with a single specific tool invocation, generated by reversing the logic from answer to question

Search-R1: A baseline agentic workflow utilizing reinforcement learning for optimization

SFT: Supervised Fine-Tuning—training a model on labeled examples (in this case, generated task trajectories) to improve its instruction-following and tool-use capabilities

Rejection Sampling: A technique used here to filter out low-quality generated tasks by verifying if they meet specific criteria (e.g., solvable by tools but not by LLM alone)

Depth-based extension: A method to increase task complexity by creating a chain of dependencies, where the output of one step becomes the input for the next

Width-based extension: A method to increase task complexity by combining multiple independent sub-problems into a single query

Agentic task: A problem requiring autonomous multi-step reasoning, tool use, and environmental interaction to solve