TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

📝 Paper Summary

Tool-use post-training RL-based tool use

TL-Training improves LLM tool use by filtering error-prone training data, applying adaptive weights to key tool-name tokens, and using PPO with rewards tailored to specific tool-invocation error types.

Core Problem

Standard supervised fine-tuning for tool use suffers from noisy training data (where models mimic errors), ignores that certain tokens are more critical than others, and lacks mechanisms to correct specific error categories.

Why it matters:

17% of high-quality training data (e.g., RoTLLaMA) contains tool-calling errors, causing models to learn incorrect behaviors
Existing SFT treats all tokens equally, even though correcting just the first token of a tool name often fixes the entire prediction
Current models like ToolLLaMA-2-7B-v2 achieve only ~80% of GPT-4's performance, indicating significant bottlenecks in standard training paradigms

Concrete Example: In a trajectory where a model should call 'calculate_loan', the training data might contain a path where the model hallucinates 'get_loan_info'. Standard SFT forces the model to learn this hallucination. TL-Training identifies this error via feedback analysis and masks the loss so the model doesn't learn from the mistake.

Key Novelty

TL-Training (Task-Feature-Based Framework)

Mitigates Adverse Effects (MAE): Automatically detects erroneous tool calls in training data by analyzing feedback and masks their loss to prevent back-propagation
Prioritizing Key Tokens (PKT): Adaptively increases loss weights for the first token of a tool name and tokens sharing prefixes with other tools, forcing the model to focus on critical decision points
Error-Specific Reward Mechanism: Defines distinct penalties for tool hallucinations, parameter errors, and missing arguments, optimizing the model via PPO (Proximal Policy Optimization)

Architecture

The TL-Training framework pipeline, illustrating the three main components: Adverse Effects Mitigation (MAE), Key Tokens Prioritization (PKT), and the Reward Mechanism for RL.

Evaluation Highlights

+15.78% performance improvement on ToolAlpaca (single-turn) compared to GPT-4-turbo using TL-CodeLLaMA-2
Achieves 5.64% total error rate on multi-turn benchmarks, second only to GPT-4o (4.76%) and outperforming Qwen-2-Instruct (7.49%)
Matches or exceeds closed-source performance using only 1,217 training samples, significantly less than typical large-scale datasets

Breakthrough Assessment

7/10

Strong empirical results with very little data (1.2k samples). The idea of masking erroneous SFT trajectories and weighting key tokens is intuitive and effective, though the architectural novelty is moderate.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn tool use where a model iteratively selects tools, processes feedback, and generates final answers

Inputs: User query q, collection of tools T, and history of tools/feedback

Outputs: Next tool call t_{s+1} or final answer

Pipeline Flow

Data Filtering (MAE): Analyze training trajectories → Identify errors via feedback → Mask loss for bad paths
SFT with Weighting (PKT): Fine-tune model → Apply high weights to tool-name start tokens
RL Optimization (PPO): Generate tool calls → Calculate reward based on error type → Update policy

System Modules

Adverse Effect Mitigator (Training - SFT)

Identifies erroneous tool calls in the dataset by checking feedback fields (e.g., 'Status: Fail') and masks them from the loss function

Model or implementation: Rule-based filter

Key Token Prioritizer (Training - SFT)

Dynamically adjusts loss weights for critical tokens (first token of tool name, disambiguating tokens)

Model or implementation: Weighting function

Reward Mechanism

Evaluates generated tool calls and assigns scores based on error hierarchy (e.g., hallucination vs. wrong parameter)

Model or implementation: Reward Function

Novel Architectural Elements

Loss masking mechanism for specific sub-sequences in SFT based on execution feedback analysis
Adaptive token-level loss weighting specifically targeting tool-name prefixes

Modeling

Base Model: CodeLLaMA-2-7B

Training Method: SFT followed by PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: SFT Loss with masking and weighting.

Formally: Minimize negative log-likelihood weighted by w_i (importance) and M_i (validity mask)
Purpose: PPO Optimization.

Formally: Maximize expected reward E[R(t_i) - β * KL(π_θ || π_sft)]

Training Data:

1,217 curated multi-turn tool-call trajectories generated with GPT-4o
Derived from RoTLLaMA training set (originally 12,247, filtered down)

Key Hyperparameters:

learning_rate: 1e-6 (SFT and Critic), 2e-6 (Actor)
batch_size: 4 (SFT), 8 (RL)
epochs: 1 (SFT), 3 (RL)
+ 2 more
warmup_rate: 0.01
w_max: 9 (max weight for PKT strategy)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolLLaMA-2: TL-Training uses 10x less data (1.2k vs 120k+) but achieves higher accuracy by filtering errors and weighting tokens
vs. NexusRaven: TL-Training incorporates reinforcement learning with fine-grained error penalties, whereas NexusRaven relies primarily on SFT
vs. ToolBench [not cited in paper]: ToolBench focuses on massive data scaling; TL-Training focuses on data quality and targeted loss manipulation

Limitations

Relies on the availability of structured feedback in training data to identify errors for masking
Analysis and experiments primarily conducted on LLaMA-2/CodeLLaMA family models
Effectiveness of PKT depends on the tokenization of tool names (identifying prefixes)

Reproducibility

Code: https://github.com/Junjie-Ye/TL-Training

Code and data available at https://github.com/Junjie-Ye/TL-Training. The paper specifies hyperparameter values for both SFT and RL stages. Custom dataset of 1,217 trajectories is provided.

📊 Experiments & Results

Evaluation Setup

Tool use evaluation across single-turn and multi-turn scenarios using provided test sets

Benchmarks:

ToolAlpaca (Single-turn tool use)
RoTBench (Single-turn tool use (Robustness))
BFCL-v3 (Single-turn tool use)
ToolEyes (Multi-turn tool use)

Metrics:

CF (Content Filling) - Overall success rate for single-turn
CE (Tool Call Error) - Rate of incorrect invocations
DE (Documentation Error) - Rate of hallucination/missing params
VA (Valid Answers) - Success rate in multi-turn
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Single-turn performance comparisons showing TL-CodeLLaMA-2 outperforms open-source baselines and rivals GPT-4.
ToolAlpaca	CF (Content Filling)	45.00	60.78	+15.78
BFCL-v3	CF (Content Filling)	73.53	85.61	+12.08
RoTBench	CF (Content Filling)	45.19	64.90	+19.71
Multi-turn performance on ToolEyes dataset, measuring error rates and valid response rates.
ToolEyes (Multi-turn)	Total Error (DE + CE)	7.49	5.64	-1.85
ToolEyes (Multi-turn)	Total Error (DE + CE)	11.12	8.38	-2.74

Experiment Figures

A breakdown of error types in the RoTLLaMA training set (generated by GPT-4), showing 17% erroneous data.

A hierarchical taxonomy of tool-use error categories (e.g., Documentation Error -> Tool Hallucination).

Main Takeaways

TL-CodeLLaMA-2 consistently exceeds average performance across all metrics on tested datasets, a unique trait among open-source models evaluated.
The model matches or beats GPT-4-turbo on specific single-turn benchmarks (ToolAlpaca, BFCL-v3) despite being a 7B model.
Ablation studies confirm that masking erroneous paths (MAE) reduces error rates by nearly one-third compared to standard SFT.
The method is highly data-efficient, achieving state-of-the-art results with only ~1.2k training examples.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) for LLMs
Reinforcement Learning with Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Tool-use/Function-calling mechanisms in LLMs

Key Terms

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps using a clipped objective

SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to a specific task

MAE: Mitigating Adverse Effects—a proposed strategy to mask the loss of erroneous tool interaction paths during training

PKT: Prioritizing Key Tokens—a proposed strategy to assign higher loss weights to critical tokens (like the start of a tool name) during training

CF: Content Filling—a metric assessing the model's ability to select the correct tool, identify parameters, and fill values

RoTLLaMA: A dataset and model baseline focused on robustness in tool learning, used here as a data source

ToolLLaMA: A family of LLaMA-based models fine-tuned specifically for tool-use tasks