Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum

📝 Paper Summary

Tool-use post-training Curriculum learning for agents Self-evolving Agentic reasoning

CTL trains LLMs to use tools via a multi-stage curriculum (easy-to-difficult) and an iterative self-instruction process that dynamically generates training data based on model introspection of past errors.

Core Problem

Existing tool-learning methods often train on limited, simple toolsets using static self-instruction, failing to generalize to complex real-world scenarios requiring selection from massive tool libraries.

Why it matters:

Real-world applications involve thousands of tools, requiring models to distinguish between relevant and irrelevant candidates
Tool complexity varies significantly; simple static datasets fail to capture the nuance needed for complicated tools (e.g., navigation vs. simple search)
Standard self-instruction lacks feedback mechanisms, leading models to overfit simple tools while failing to master intricate ones

Concrete Example: A Google Map tool might only need coordinates for 'exploring', but requires start/end points and preferences for 'planning a commute'. A model trained only on simple cases fails to provide the necessary parameters for the complex case.

Key Novelty

Curriculum Tool Learning (CTL) with Iterative Self-instruction from Introspective Feedback (ISIF)

Decomposes training into three stages (Warm-up, In-category, Cross-category) to gradually increase difficulty from simple execution to complex selection from large libraries
Uses an iterative feedback loop where the model 'introspects' on its own failures to generate new, targeted training examples for tools it currently struggles with, rather than random sampling

Evaluation Highlights

Outperforms ChatGPT (tuning-free) by +9.2% success rate on unseen tools in ToolBench
Surpasses GPT4Tools (tuning-based) by +13.5% success rate on unseen instructions
Achieves comparable performance to ChatGPT on unseen datasets while using a much smaller open-source backbone (e.g., LLaMA-7B)

Breakthrough Assessment

7/10

Strong methodological contribution in curriculum design and dynamic data generation for tool use. Demonstrates solid gains over both tuning-free and tuning-based baselines on standard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Tool learning setting where an LLM must select and execute appropriate tools from a large candidate set to answer a user query

Inputs: User query $q$, set of candidate tools $T$

Outputs: Sequence of tool calls and final response

Pipeline Flow

Tool Retriever → Tool Execution/Reasoning (LLM) → Response Generation

System Modules

Tool Retriever

Select relevant tools from a massive pool based on query semantics

Model or implementation: Sentence-BERT (fine-tuned)

Tool-Use LLM

Reason about the query, select specific tools from the retrieved candidates, generate arguments, and execute

Model or implementation: LLaMA-2-7B / Vicuna-7B

Novel Architectural Elements

Integration of an introspective feedback loop into the data generation pipeline (ISIF), dynamically altering the training distribution based on model performance
Three-stage curriculum architecture explicitly separating execution learning, selection learning, and retrieval-based learning

Modeling

Base Model: LLaMA-2-7B and Vicuna-7B

Training Method: Supervised Fine-Tuning (SFT) on dynamically generated datasets

Objective Functions:

Purpose: Standard language modeling loss for instruction tuning.

Formally: Maximize likelihood of target tokens given context.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of the base model

Training Data:

Stage 1 (Warm-up): 80k instances (ToolBench)
Stage 2 (In-category): Constructed by mixing relevant tools within the same category
Stage 3 (Cross-category): Constructed using dense retrieval to simulate real-world noise
ISIF: Iteratively generated data based on failure analysis

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 2 (for Stage 1), varies for others
+ 1 more
max_length: 4096

Compute: Training performed on 8 NVIDIA A800 GPUs

Comparison to Prior Work

vs. ToolLLM: ToolLLM uses a static dataset approach (DFSDT). CTL uses a dynamic, iterative curriculum that generates new data based on model weaknesses.
vs. GPT4Tools: GPT4Tools typically provides a fixed toolset. CTL explicitly trains for retrieval-based scenarios where the model must filter irrelevant retrieved tools.
vs. Standard Self-Instruct: Standard methods don't use feedback to update the data distribution. CTL's ISIF specifically targets 'intricate' tools that the model fails on.

Limitations

Dependency on ChatGPT for data generation and introspection (cost and availability constraints)
Retriever quality limits the upper bound of performance in the cross-category stage
Iterative process increases total training time compared to static one-pass training

Reproducibility

Code: https://github.com/shizhl/CTL

Code and data are publicly available at https://github.com/shizhl/CTL. The paper details the prompts used for ISIF and the curriculum stages.

📊 Experiments & Results

Evaluation Setup

Tool-use evaluation on the ToolBench dataset, involving instruction following and tool execution.

Benchmarks:

ToolBench (Tool Learning / API Execution)

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on Unseen Instructions (Test Set I1) showing generalization to new queries for known tools.
ToolBench (I1-Inst.)	Success Rate	57.2	60.0	+2.8
ToolBench (I1-Inst.)	Success Rate	58.0	60.0	+2.0
Performance comparison on Unseen Tools (Test Set I2 Category) showing generalization to completely new tool categories.
ToolBench (I2-Cat.)	Success Rate	51.1	60.3	+9.2
ToolBench (I2-Cat.)	Success Rate	46.8	60.3	+13.5
Performance on Unseen Tools (Test Set I3 Tool) showing generalization to new tools within known categories.
ToolBench (I3-Tool)	Success Rate	55.6	65.3	+9.7

Main Takeaways

Multi-stage curriculum effectively bridges the gap between simple execution and complex selection, improving performance on unseen tools.
Iterative feedback (ISIF) prevents the model from overfitting to easy tools by forcing it to practice intricate ones.
The approach generalizes well to unseen instructions and categories, outperforming proprietary models like ChatGPT in specific tool-use benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Instruction tuning / Fine-tuning LLMs
Tool learning / API integration with LLMs
Curriculum learning concepts
Retrieval-augmented generation (for tool selection)

Key Terms

ISIF: Iterative Self-instruct from Introspective Feedback—a data generation method where the model identifies tools it struggles with and generates new training examples specifically for those tools

Introspection: The process where the model (or a teacher model) evaluates the correctness of a tool execution to identify failure modes

Curriculum Learning: A training strategy where the model is presented with easier tasks first (e.g., known tools) before moving to harder tasks (e.g., selecting from unknown tools)

Tool Retrieval: The process of selecting a small subset of relevant tools from a large library based on the user query, typically using a dense retriever

Self-Instruct: A framework for generating instruction-following data by prompting a strong teacher model (like ChatGPT) to generate inputs and outputs