Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation

📝 Paper Summary

Multi-call tool use with flexible plan Multi-task planning

DTA-Llama transforms sequential tool invocation into parallelizable 'Process/Threads' execution by restructuring training data into Directed Acyclic Graphs, significantly reducing inference latency and token costs.

Core Problem

Existing tool learning methods (CoT, ReAct, DFSDT) execute tools sequentially or use backtracking search trees, leading to limited perceptual scope, high token consumption, and slow inference speeds.

Why it matters:

Sequential invocation forces LLMs to wait for each tool result before planning the next step, preventing efficiency gains from parallelizable sub-tasks.
Tree-based methods like ToolLLM improve success rates via backtracking but incur massive computational overhead and latency, making them impractical for real-time applications.
Real-world complex tasks often contain independent components (e.g., booking two different flights) that current serial agents handle inefficiently.

Concrete Example: In a task requiring weather checks for both 'Beijing' and 'Shanghai', a ReAct agent queries Beijing, waits for the result, then queries Shanghai. DTA-Llama identifies these as independent sub-tasks and queries both API endpoints simultaneously.

Key Novelty

Divide-Then-Aggregate (DTA) Parallel Tool Invocation

Replaces the 'Thought-Action-Observation' loop with a 'Process-Threads-Aggregate' mechanism where the LLM (Process) generates a batch of parallelizable tool calls.
Executes these calls concurrently in separate Threads and uses an intermediate state lock to wait for all results before aggregating them back to the LLM.
Constructs training data by converting serial successful paths from tree-search methods into Directed Acyclic Graphs (DAGs) using GPT-4, identifying parallelizable nodes.

Architecture

The overall framework of DTA-Llama, contrasting the data construction phase (DAG transformation) and the inference phase (Process/Threads mechanism).

Evaluation Highlights

Achieves 83.35% Solvable Pass Rate (SoPR) on StableToolBench, surpassing ToolLLM (DFSDT) while using significantly fewer tokens.
Reduces average inference time by ~2.5x compared to ToolLLM (DFSDT) and ~1.5x compared to standard CoT methods.
Llama-2-7B fine-tuned with DTA matches the performance of GPT-3.5-Turbo's official parallel function calling capability.

Breakthrough Assessment

8/10

Offers a highly practical solution to the efficiency bottleneck in tool-using agents. The shift from serial/tree search to DAG-based parallel execution is a logical and effective step forward.

⚙️ Technical Details

Problem Definition

Setting: Task-oriented tool learning where an agent must use external APIs to solve a user instruction q.

Inputs: User instruction q and a set of available tools.

Outputs: A sequence of tool invocations and the final answer.

Pipeline Flow

Process: Task Planning & Decomposition
Threads: Parallel Tool Execution
Aggregation: Result Locking & Collection

System Modules

Process (Planner)

Analyzes history, decomposes the current task, and generates a batch of independent tool invocation plans

Model or implementation: DTA-Llama (Llama-2-7B or similar)

Threads (Executor)

Executes the tool plans generated by the Process concurrently

Model or implementation: External APIs / Tool Environment

Intermediate State Lock

Waits for all Threads to complete and aggregates their results into a unified Observation

Model or implementation: Deterministic Code Logic

Novel Architectural Elements

Transformation of the sequential Thought-Action-Observation loop into a Process-Threads-Aggregation parallel workflow
Use of an 'Intermediate State Lock' to synchronize parallel tool outputs before feeding them back to the LLM

Modeling

Base Model: Llama-2-7B

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize the negative log-likelihood of the generated thoughts and tool plans given the history.

Formally: Standard causal language modeling loss L = -sum(log P(y_i | q, o_{<i}, y_{<i}))

Training Data:

Derived from ToolBench dataset (ToolLLM)
Selected successful paths from DFSDT trees
Used GPT-4-turbo to analyze dependencies and transform linear paths into DAGs
Filtered for cyclic graphs and non-aggregatable results
Result: ~20k entries in DTA-Tool dataset

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolLLM: DTA avoids backtracking by planning parallel paths upfront, reducing latency significantly.
vs. ReAct: DTA allows multiple tools to be called in a single turn (parallel) rather than one by one (serial).
vs. GPT-3.5 Function Calling: DTA achieves comparable performance using a much smaller open-source model (7B) via fine-tuning on DAG-structured data.

Limitations

Depends on the quality of GPT-4's transformation of serial paths to DAGs during data construction.
The 'Intermediate State Lock' mechanism forces a wait for the slowest tool in a parallel batch, potentially limited by the highest latency API.
Limited exploration of failure recovery compared to tree-based search methods that can backtrack extensively.

Reproducibility

Code: https://corn0205.github.io/

Code, dataset, and model weights are publicly available at https://corn0205.github.io/. GPT-4-turbo was used for data construction (closed source dependency). Hyperparameters for training are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluated on StableToolBench, a benchmark for real-world tool use stability.

Benchmarks:

StableToolBench (Tool-use Query Answering)

Metrics:

Solvable Pass Rate (SoPR)
Solvable Win Rate (SoWR)
Token Consumption
Inference Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on StableToolBench showing DTA-Llama's superiority over baselines.
StableToolBench	SoPR (Solvable Pass Rate)	56.80	83.35	+26.55
StableToolBench	SoPR (Solvable Pass Rate)	82.80	83.35	+0.55
StableToolBench	Average Inference Time (seconds)	110	45	-65

Experiment Figures

Comparison of different tool learning paradigms (CoT, ReAct, DFSDT, DTA) regarding structure and efficiency.

Main Takeaways

DTA-Llama significantly outperforms serial (CoT/ReAct) and tree-based (DFSDT) methods in success rate (SoPR) while drastically reducing inference time.
The method demonstrates strong generalization when applied to different base models (Llama-2-7B, Llama-2-13B, etc.).
Parallel invocation reduces the number of interaction turns required, directly lowering token consumption compared to backtracking search methods.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and fine-tuning
Familiarity with Chain-of-Thought (CoT) and ReAct prompting
Basic knowledge of Tree-Search algorithms (DFS) and Directed Acyclic Graphs (DAGs)
Concepts of concurrency (Processes/Threads) in operating systems

Key Terms

DAG: Directed Acyclic Graph—a data structure with directed edges and no cycles, used here to represent dependencies between tool calls

DFSDT: Depth First Search-based Decision Tree—a search algorithm used by previous methods (ToolLLM) that explores tool paths sequentially with backtracking

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

ReAct: Reason+Act—a framework where LLMs alternate between reasoning traces and tool actions

SoPR: Solvable Pass Rate—the percentage of solvable tasks that the model successfully completes

SoWR: Solvable Win Rate—the percentage of tasks where the model's solution is judged as better than or equal to a baseline (often by ChatGPT)

Process/Threads: A computing analogy where 'Process' is the main planning agent and 'Threads' are parallel execution units for tool calls