Making Language Models Better Tool Learners with Execution Feedback

📝 Paper Summary

Tool-use post-training RL-based tool learning

Trice is a two-stage training framework that teaches language models to selectively use tools only when necessary by leveraging reinforcement learning on execution feedback.

Core Problem

Existing tool learning methods often force models to use tools indiscriminately, even for simple tasks the model could solve itself.

Why it matters:

Using tools for simple problems can propagate errors (e.g., wrong tool selection or inputs) rather than helping
Current approaches lack the ability to discern *when* a tool is actually necessary versus when the model's internal knowledge suffices
Excessive reliance on tools increases computational cost and latency without guaranteeing better performance

Concrete Example: For a simple question like 'What is 2+2?', a model forced to use a calculator might generate a malformed API call or misinterpret the output, failing a task it could have answered directly. Trice teaches the model to answer '4' directly but invoke a calculator for '345 * 921'.

Key Novelty

Tool Learning with Execution Feedback (Trice)

Uses a two-stage process: first cloning behavior from a dataset where tool use is only labeled for hard instances (where the base model fails)
Then uses Reinforcement Learning with Execution Feedback (RLEF) to align the model with responses that correctly decide whether to use a tool or not based on actual success

Architecture

The two-stage training framework of Trice.

Evaluation Highlights

Outperforms baselines (including ChatGPT and Vicuna) on 4 tasks across 8 datasets, showing better selective tool usage
Reduces tool usage rate on easy instances while maintaining high performance on hard instances
Achieves higher accuracy than 100% tool-use baselines, proving that selective usage prevents error propagation

Breakthrough Assessment

7/10

Addresses a critical and often overlooked problem in tool learning (over-reliance). The methodology is sound and the results demonstrate clear benefits of selective execution.

⚙️ Technical Details

Problem Definition

Setting: Instruction following where the model must decide between generating a direct answer 'a' or a tool API call 't' for a given question 'q'

Inputs: Instruction 's' and Question 'q'

Outputs: Either a direct answer or a structured tool call (Tool_name(Tool_input))

Pipeline Flow

Input Processing (Question + Instruction)
Decision Making (Internal Knowledge vs. Tool)
Tool Execution (if selected)
Output Generation

System Modules

LLM Backbone

Decides whether to answer directly or generate a tool call

Model or implementation: ChatGLM-6B, Alpaca-7B, Vicuna-7B (trained via LoRA)

External Tool

Executes the specific function if called

Model or implementation: Calculator, WikiSearch, Calendar, Translation Model

Novel Architectural Elements

End-to-end training pipeline incorporating a binary decision (Tool vs. No-Tool) implicitly through generation, reinforced by execution feedback rather than just text similarity

Modeling

Base Model: ChatGLM-6B, Alpaca-7B, Vicuna-7B

Training Method: Two-stage: Behavior Cloning (SFT) + Reinforcement Learning with Execution Feedback (RLEF)

Objective Functions:

Purpose: Supervised learning for behavior cloning.

Formally: Standard Cross-Entropy Loss on the target tokens.
Purpose: Rank candidate responses based on quality.

Formally: Ranking Loss L_rank = - log(sigmoid(score(y_better) - score(y_worse)))
Purpose: Maintain language capabilities during RL.

Formally: L_RLEF = L_rank + alpha * L_SFT

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Derived from benchmarks (Math, QA, etc.).
Pseudo-labels generated by ChatGPT: If base model fails, ChatGPT generates tool call. If base model succeeds, label is 'None' (no tool).

Key Hyperparameters:

learning_rate: 2e-5 (Alpaca/Vicuna), 2e-5 to 3e-4 (ChatGLM)
epochs: 5 (Stage I), 2 (Stage II)
alpha: {0.01, 0.1, 1}
+ 3 more
lora_r: 8
lora_alpha: 16
batch_size: 128 (micro batch 4/8)

Compute: Trained on 4 NVIDIA 3090 GPUs

Comparison to Prior Work

vs. Toolformer: Trice uses explicit execution feedback ranking (RLEF) rather than just perplexity-based filtering [not cited in paper]
vs. GPT-3.5/ChatGPT: Trice is a fine-tuning framework for smaller open-source models, enabling them to outperform larger frozen models on specific tool tasks
vs. Standard SFT (100% Tool): Trice teaches *selective* usage, avoiding tools for easy queries where SFT fails due to over-reliance

Limitations

Relies on ChatGPT for generating pseudo-labels, inheriting potential biases or errors from the teacher model
Evaluated on a limited set of specific tools (Calculator, Search, Calendar, Translator), effectively one tool per task type
Requires execution feedback which might be sparse or binary (correct/incorrect) in some real-world scenarios
Two-stage training adds complexity compared to simple SFT

Reproducibility

Code: https://github.com/zjunlp/TRICE

Code is publicly available at https://github.com/zjunlp/TRICE. Data preparation details (using ChatGPT for pseudo-labeling) are described. Hyperparameters for LoRA and training are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Tasks: Arithmetic (Math), Knowledge QA (LAMA), Temporal QA (TimeQA), Multilingual QA. Each task paired with a specific tool.

Benchmarks:

GSM8K (Arithmetic Reasoning)
SVAMP (Arithmetic Reasoning)
LAMA (Knowledge Retrieval)
TimeQA (Temporal Reasoning)

Metrics:

Accuracy (Match with gold answer)
Tool Usage Rate (Percentage of times tool was called)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Trice outperforms baselines on arithmetic reasoning tasks using a Calculator tool.
GSM8K	Accuracy	39.27	42.08	+2.81
SVAMP	Accuracy	56.40	60.43	+4.03
Trice demonstrates superior performance on Knowledge QA using a Search tool (Atlas).
LAMA	Accuracy	45.05	54.00	+8.95
Ablation studies confirm the contribution of the RLEF stage (Stage II).
GSM8K	Accuracy	39.27	42.08	+2.81

Experiment Figures

Ablation study bar charts comparing Stage I vs. Stage II vs. Full Trice across benchmarks.

Analysis of 'Insufficient Tool Learning' vs 'Excessive Reliance' across different training steps.

Main Takeaways

Selective tool use is superior to mandatory tool use: forcing models to use tools for everything (100% Tool baseline) often hurts performance compared to Trice.
Execution feedback is effective: The RLEF stage consistently improves over simple behavior cloning, refining the decision boundary of when to use tools.
Generalization: Trice works across different backbones (Alpaca, Vicuna, ChatGLM) and different task types (Math, QA, Translation).
Analysis shows Trice reduces 'Over-Reliance' (using tools when not needed) and 'Insufficient Learning' (not using tools when needed) compared to baselines.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF) concepts
Instruction Tuning / Fine-tuning
Basic understanding of Tool Learning/API calling

Key Terms

RLEF: Reinforcement Learning with Execution Feedback—reinforcing model behavior based on the success/failure of the code/tool execution rather than just human preference

Behavior Cloning: Supervised fine-tuning where the model learns to mimic a dataset of demonstrations (in this case, correct tool usage patterns)

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

PPO: Proximal Policy Optimization—an RL algorithm used to update the policy

Tool API: A specific interface (like a calculator or search engine) the model can invoke

Pass rate: The percentage of test cases where the model's output (either direct or via tool) is correct

Ranking Loss: A loss function that trains the model to assign higher scores to better candidate responses compared to worse ones