Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

📝 Paper Summary

Multi-call tool use with flexible plan Multi-agent

α-UMi decomposes a tool-learning agent into three specialized small LLMs (planner, caller, summarizer) fine-tuned via a global-to-local progressive strategy, outperforming single-LLM approaches.

Core Problem

Single small LLMs (e.g., 7B) struggle to simultaneously master diverse tool-use capabilities like reasoning, precise API formatting, and summarization due to limited model capacity.

Why it matters:

Small open-source models (like LLaMA-7B) significantly lag behind closed-source giants (GPT-4) in complex agent tasks
Training a single model for all agent sub-tasks creates interference, where improving reasoning might degrade API syntax compliance
Real-world tool updates require expensive retraining of the entire monolithic model rather than just the relevant component

Concrete Example: In a complex task requiring multiple API calls (e.g., searching for a video and then getting its details), a single 13B model might get stuck in a loop calling a broken API or fail to format the request correctly, whereas the specialized planner in α-UMi detects the failure and redirects the caller to an alternative tool.

Key Novelty

α-UMi (Multi-LLM Agent Framework)

Decompose the monolithic agent role into three specialized roles: a 'Planner' for reasoning/direction, a 'Caller' for syntax-perfect tool invocation, and a 'Summarizer' for user response generation
Global-to-Local Progressive Fine-Tuning (GLPFT): First train a single backbone on all tasks to establish shared understanding, then spawn three copies and fine-tune each exclusively on its specific sub-task to maximize specialization

Architecture

Conceptual comparison between Single-LLM agent and α-UMi Multi-LLM agent

Evaluation Highlights

+5.68 Plan ACC (Planning Accuracy) improvement on ToolBench (In-domain) using LLaMA-2-7B compared to a single-LLM baseline
+10.2 pass rate improvement over ToolLLaMA on ToolBench real-time evaluation
Surpasses Single-LLM agents using 13B models while using only 7B models for the multi-agent components, proving small specialized models can beat larger monolithic ones

Breakthrough Assessment

7/10

Strong empirical evidence that decomposing agent functions enables smaller models to compete with larger ones. The Global-to-Local fine-tuning strategy is a practical innovation for maximizing small model capacity.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn tool-use agent task where an agent interacts with external APIs to solve a user instruction q

Inputs: User instruction q, system prompt P, and execution trajectory τ_{t-1}

Outputs: Rationale r_t, Action a_t (tool call), or Final Answer a_n

Pipeline Flow

Planner (Determines next step: Call, Summarize, or Give Up)
Caller (If Planner says 'Call': Generates specific tool invocation syntax)
Tool Execution (External environment returns observation)
Summarizer (If Planner says 'Summarize': Generates final response based on history)

System Modules

Planner

Generate rationale and decide the next high-level action (Call, Summarize, or Give Up)

Model or implementation: LLaMA-2-7B or 13B (fine-tuned)

Caller

Generate the precise API call (action) based on the Planner's rationale

Model or implementation: LLaMA-2-7B or 13B (fine-tuned)

Summarizer

Synthesize the execution history into a final user-facing answer

Model or implementation: LLaMA-2-7B or 13B (fine-tuned)

Novel Architectural Elements

Explicit decomposition of the ReACT loop into three distinct LLM checkpoints (Planner, Caller, Summarizer) rather than a single model
Conditional control flow where Planner dictates which specialized LLM activates next

Modeling

Base Model: LLaMA-2-chat (7B and 13B variants)

Training Method: Global-to-Local Progressive Fine-Tuning (GLPFT)

Objective Functions:

Purpose: Standard language modeling loss for next-token prediction.

Formally: Autoregressive cross-entropy loss.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (during respective stages)

Training Data:

Stage 1 (Global): Train on entire ToolBench/ToolAlpaca dataset without discrimination
Stage 2 (Local): Reorganize dataset into sub-tasks (Plan, Call, Summarize). Mask loss for non-target tokens (e.g., Planner only trained on rationales).

Key Hyperparameters:

learning_rate_global: 5e-5
learning_rate_local: 1e-5
epochs_global: 2
+ 5 more
epochs_local_planner: 1
epochs_local_caller: 1
epochs_local_summarizer: 2
global_batch_size: 48
max_sequence_length: 4096

Compute: Training 7B α-UMi: 63.34h on 8 A100 GPUs (vs 41.54h for Single-LLM). Inference time per instance: 6.27s (comparable to Single-LLM 6.41s).

Comparison to Prior Work

vs. ToolLLaMA: Decomposes the single model into three specialized models; introduces two-stage GLPFT training
vs. Multi-LLM one-stage: Introduces the 'Global' pre-training stage to establish shared context before specialization, which proves critical for performance
vs. MetaGPT/ChatDev: Focuses on fine-tuning open-source small models for general tool use rather than prompting closed-source models for software engineering

Limitations

Increased storage cost (3x model parameters stored) compared to single-LLM
Higher training cost (1.5x time) due to multi-stage process
Current reliance on simple greedy decoding; no advanced search (like Tree of Thoughts) integrated yet

Reproducibility

Code: https://github.com/X-PLUG/Multi-LLM-Agent

Code and data processing scripts available at https://github.com/X-PLUG/Multi-LLM-Agent. Uses standard datasets (ToolBench, ToolAlpaca). Backbone models are LLaMA-2-chat.

📊 Experiments & Results

Evaluation Setup

Tool-use evaluation on held-out test sets involving API calls

Benchmarks:

ToolBench (Diverse real-world API calling tasks)
ToolAlpaca (Simulated tool-use environments)

Metrics:

Plan ACC (Planning Accuracy)
Act. EM (Action Exact Match)
Hallu. (Hallucination Rate)
Arg. F1 (Argument F1)
R-L (Rouge-L for summary)
Pass Rate (Real-time execution success)
Win Rate (vs ChatGPT-ReACT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against Single-LLM baselines on ToolBench (In-domain) shows α-UMi's superiority across planning and execution metrics.
ToolBench (In-domain)	Plan ACC	81.92	88.92	+7.00
ToolBench (In-domain)	Act. EM	53.26	58.94	+5.68
ToolBench (In-domain)	Hallu.	2.32	0.57	-1.75
Real-time evaluation results on ToolBench demonstrating execution success rates.
ToolBench	Pass Rate	60.7	70.9	+10.2
ToolBench	Pass Rate	40.2	70.9	+30.7
Ablation study showing the necessity of the Global-to-Local (reuse) strategy.
ToolBench (In-domain)	Act. EM	45.11	58.94	+13.83

Experiment Figures

Data scaling law curves for Plan ACC, Act. EM, Hallu., etc., as training data increases from 12.1k to 62.7k

Training loss curves for Rationale, Action, and Answer components over epochs

Main Takeaways

Specialization beats scaling: A 7B multi-LLM agent (α-UMi) outperforms a 13B single-LLM agent on most tool-use metrics.
Global-to-Local training is critical: Skipping the global fine-tuning stage (Multi-LLM one-stage) results in performance worse than the baseline Single-LLM, likely due to a lack of shared context.
Hallucination reduction: Separation of the 'Caller' role significantly reduces API name hallucinations compared to monolithic models.
Data Reuse helps: Reusing the same instructions for local fine-tuning (Stage 2) works better than introducing new/diverse instructions, likely avoiding distribution shift between the global and local phases.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the ReACT framework (Reasoning and Acting)
Familiarity with Instruction Tuning / Supervised Fine-Tuning (SFT)
Basic knowledge of LLM agent architectures

Key Terms

ReACT: Reasoning and Acting—a prompting framework where LLMs generate reasoning traces (thoughts) before taking actions (tool calls)

GLPFT: Global-to-Local Progressive Fine-Tuning—the paper's proposed training strategy where a model is first trained on all tasks, then cloned and specialized

Action EM: Action Exact Match—a metric checking if the predicted API call matches the ground truth exactly

Hallu.: Hallucination rate—specifically measuring how often the model invokes non-existent tools

Plan ACC: Planning Accuracy—accuracy of the agent's high-level decision (e.g., Call Tool vs. Finish vs. Give Up) at each step

Arg. F1: Argument F1—a metric measuring the overlap of arguments in the generated API call vs. the reference