Learning to Use Tools via Cooperative and Interactive Agents

📝 Paper Summary

Multi-agent tool use Agent coordination

ConAgents decomposes complex tool-use tasks into three specialized agents (Grounding, Execution, Review) that cooperate via iterative feedback loops, enabling self-correction and better performance than single-agent pipelines.

Core Problem

Existing single-agent tool-use methods follow rigid pipelines that lack flexibility to correct errors mid-execution and struggle to master diverse sub-skills (planning vs. coding) simultaneously.

Why it matters:

Rigid pipelines (Plan → Act) often fail propagate errors forward without correction, leading to cascading failures in multi-step tasks
Forcing one LLM to handle planning, coding, and reflection simultaneously overloads its context and capabilities, causing performance degradation on complex tasks
Open-source models struggle more with monolithic agent roles compared to stronger closed-source models like GPT-4

Concrete Example: If a tool execution fails due to a wrong argument (e.g., searching for a movie with the wrong date format), a standard ReAct agent might just retry blindly or crash. In ConAgents, the Review Agent spots the error in the execution code, explains the format mismatch to the Execution Agent, which then corrects the code dynamically.

Key Novelty

ConAgents: Cooperative and Interactive Agents Framework

Decomposes the tool-use process into three distinct roles: Grounding (planning), Execution (writing code/calling tools), and Review (critiquing and correcting)
Introduces two communication protocols: 'Automatic' (always review every step) and 'Adaptive' (review only when errors occur), allowing dynamic self-correction
Uses 'Specialized Action Distillation' (SPAN) to train smaller open-source models on these specific roles using clustered, high-quality trajectories from GPT-4

Architecture

The ConAgents framework showing the interaction between Grounding, Execution, and Review agents under two protocols.

Evaluation Highlights

Outperforms state-of-the-art baselines (e.g., DFSDT, ReAct) by up to +14% Success Rate on ToolBench and RestBench
Specialized Action Distillation (SPAN) enables Llama-2-7B to achieve strong performance with only 500 training examples per agent
Ablation studies show the Review Agent significantly boosts performance (+6% success rate) by catching errors before they propagate

Breakthrough Assessment

7/10

Strong empirical results and a logical decomposition of agent roles. The distillation strategy for open-source models is practical. While multi-agent reflection is becoming common, the specific protocols and distillation method offer solid contributions.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn tool-use task solving where an agent must select and execute tools to satisfy a user instruction

Inputs: Natural language task description x and a set of available tools S

Outputs: A sequence of actions ending in a final response or solution to the task

Pipeline Flow

Grounding Agent (Plan Generation)
Review Agent (Plan Critique - optional/conditional)
Execution Agent (Tool Invocation/Code Generation)
Review Agent (Execution Critique - optional/conditional)
Environment (Tool Execution & Result Feedback)

System Modules

Grounding Agent

Decompose task and generate next step plan t_i

Model or implementation: Various (GPT-4, ChatGPT, Llama-2-7B/13B via SPAN)

Execution Agent

Generate executable code c_i based on plan t_i

Model or implementation: Various (GPT-4, ChatGPT, Llama-2-7B/13B via SPAN)

Review Agent

Check correctness of plan or execution code; provide verbal feedback

Model or implementation: Various (GPT-4, ChatGPT, Llama-2-7B/13B via SPAN)

Novel Architectural Elements

Decomposition of the single 'Agent' abstraction into three explicit roles (Grounding, Execution, Review) with formalized inter-agent feedback loops
Dual communication protocols (Automatic vs. Adaptive) governing when the Review Agent intervenes

Modeling

Base Model: Llama-2-7B, Llama-2-13B, ChatGPT (gpt-3.5-turbo-0613), GPT-4

Training Method: Supervised Fine-Tuning (SFT) via LoRA (Low-Rank Adaptation)

Objective Functions:

Purpose: Minimize negative log-likelihood of the target action tokens given the context.

Formally: Standard causal language modeling loss.

Adaptation: LoRA (r=8, alpha=16, dropout=0.05)

Trainable Parameters: LoRA adapters on Query and Value projection layers

Training Data:

Sampled 2,919 high-quality tasks from ToolBench training set
Clustered tasks to remove duplicates
Generated solution trajectories using GPT-4 with ConAgents framework
Reorganized trajectories into role-specific (Grounding, Execution, Review) instruction-tuning data

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 128 (micro-batch 4 with accumulation)
epochs: 2
+ 3 more
max_length: 4096
lora_r: 8
lora_alpha: 16

Compute: 8 NVIDIA A800 GPUs for training

Comparison to Prior Work

vs. ReAct: ConAgents adds explicit Review Agent and separates Planning/Execution roles, allowing self-correction
vs. RestGPT: ConAgents introduces 'Adaptive' and 'Automatic' interaction protocols for flexible error handling, whereas RestGPT is rigid
vs. ToolLLM: ConAgents distills into specialized role-based models (SPAN) rather than a single monolithic tool-use model
+ 1 more
vs. AutoAct [not cited in paper]: AutoAct also distinguishes planning/execution roles but focuses on self-instruction without the explicit 'Review' agent feedback loop ConAgents emphasizes

Limitations

Dependency on powerful teacher models (GPT-4) for generating high-quality training data for distillation
Increased inference cost and latency due to multiple agent interactions and review steps compared to single-pass methods
Review Agent can sometimes hallucinate errors or provide incorrect feedback, potentially derailing valid plans
Effectiveness of SPAN distillation is limited by the diversity and quality of the sampled ToolBench tasks

Reproducibility

Code: https://github.com/shizhl/ConAgents

Code is publicly available at https://github.com/shizhl/ConAgents. The paper details the data construction process (SPAN) clearly, including heuristics for filtering and clustering ToolBench tasks. Hyperparameters for LoRA training are explicitly listed.

📊 Experiments & Results

Evaluation Setup

Tool use on complex queries requiring API calls

Benchmarks:

ToolBench (Diverse tool use (G1, G2, G3 difficulty levels))
RestBench (RESTful API calls (TMDB, Spotify))

Metrics:

Pass Rate (PR)
Win Rate (WR) against baseline solutions (evaluated by ChatGPT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on ToolBench showing ConAgents superiority over baselines across different difficulty levels (I1, I2, I3). Pass Rate (PR) is the primary metric.
ToolBench (I1 - Easy)	Pass Rate	56.4	60.0	+3.6
ToolBench (I2 - Medium)	Pass Rate	53.6	62.4	+8.8
ToolBench (I3 - Hard)	Pass Rate	50.0	60.7	+10.7
Results for open-source models (Llama-2-13B) enhanced with SPAN distillation compared to monolithic baselines.
ToolBench (Avg)	Pass Rate	47.9	52.3	+4.4
Ablation studies on the Review Agent's impact.
ToolBench	Pass Rate	54.6	61.0	+6.4

Main Takeaways

ConAgents consistently outperforms single-agent baselines (ReAct, ToolLLM) and rigid pipelines (RestGPT), especially on harder tasks (I3 subset)
The Adaptive interaction protocol generally performs better or comparable to Automatic while being more efficient, as it only triggers reviews on errors
Specialized Action Distillation (SPAN) effectively transfers multi-agent capabilities to smaller open-source models with limited data (500 samples), surpassing monolithic training
The Review Agent is critical; removing it drops performance significantly, validating the importance of iterative self-correction

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of LLM agents (ReAct, Chain-of-Thought)
Familiarity with tool-use benchmarks (ToolBench, RestBench)
Knowledge of instruction tuning and distillation

Key Terms

Grounding Agent: The agent responsible for reasoning about the task and generating a high-level plan or selecting which tool to use next

Execution Agent: The agent responsible for translating the plan into executable code (e.g., Python requests) or specific API calls

Review Agent: The agent acting as a critic that inspects the plan or execution results for errors and provides verbal feedback for correction

SPAN: Specialized Action Distillation—a method proposed in this paper to distill GPT-4's capabilities into smaller models by training them on role-specific sub-tasks

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

ReAct: Reason+Act—a prompting paradigm where LLMs generate reasoning traces before executing actions

DFSDT: Depth First Search Decision Tree—a baseline method that explores tool-use paths using a tree search strategy