Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

📝 Paper Summary

Tool profiling Tool-use post-training

Trace-Free+ uses curriculum learning to train a tool description generator that transfers insights from trace-rich training scenarios to trace-free inference settings, enabling better tool use without requiring trial-and-error interactions.

Core Problem

Tool interfaces (descriptions and schemas) are designed for humans, not LLMs, leading to failures in tool selection and parameter generation. Existing optimization methods rely on execution traces (trial loops), which are unavailable in cold-start or privacy-constrained deployments.

Why it matters:

LLM agents often fail when selecting from large tool sets because documentation is ambiguous or missing critical usage constraints
Interacting with tools to collect traces (e.g., 'try and fail') is often unsafe, costly (API fees), or impossible for new private tools
Existing prompt-based optimization is slow and does not generalize to unseen tools

Concrete Example: A tool parameter 'ip_address' might require strictly IPv4 format, but the original description doesn't state this. An agent fails repeatedly trying IPv6. Trace-based methods fix this by observing the failure, but Trace-Free+ learns to predict such constraints proactively without needing the initial failure.

Key Novelty

Curriculum Learning for Trace-Free Tool Interface Improvement

Trains a model to rewrite tool descriptions by progressively removing access to execution traces during training
Uses an agentic workflow to synthesize a large-scale dataset of high-quality tool descriptions grounded in real execution failures, then distills this knowledge into a generator
Enables 'zero-shot' optimization of documentation for new tools without requiring any API calls

Architecture

Conceptual workflow of Trace-Free+: From data synthesis (using traces to fix docs) to curriculum training (gradually hiding traces) to trace-free inference.

Evaluation Highlights

+7.1% success rate on StableToolBench (unseen tools) compared to the original tool descriptions using Llama-3-8B
Achieves performance comparable to trace-based upper bounds while using zero traces at inference time
Robust to scaling: maintains performance even as the number of candidate tools increases to 100, whereas baselines degrade significantly

Breakthrough Assessment

7/10

Offers a practical solution to the 'cold start' problem in tool use. While the gain is moderate (+7%), the ability to optimize documentation without execution access is a significant deployment enabler.

⚙️ Technical Details

Problem Definition

Setting: Generating improved tool descriptions d' from original descriptions d and schemas s, to maximize agent success rate R(A'; Q)

Inputs: Original tool interface (description d, schema s)

Outputs: Improved tool description d'

Pipeline Flow

Tool Description Generator (Trace-Free+) → Generates d'
Tool Agent (Downstream User) → Uses d' to solve tasks

System Modules

Description Generator

Rewrites original tool documentation to be more agent-friendly

Model or implementation: Llama-3-8B-Instruct (Fine-tuned)

Novel Architectural Elements

Curriculum learning schedule that shifts supervision from trace-rich inputs (interface + execution history) to trace-free inputs (interface only) during training

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT) with Curriculum Learning

Objective Functions:

Purpose: Minimize difference between generated description and ground-truth improved description.

Formally: Standard auto-regressive language modeling loss on the target description d'.

Training Data:

Source: 107 filtered 'seed' API providers from ToolBench
Workflow: (1) Annotate API health, (2) Synthesize queries, (3) Collect traces, (4) Improve descriptions using RIMRULE rules extracted from failure traces
Total: ~4.5k training examples (mixed trace-based and trace-free)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 3
+ 1 more
curriculum_schedule: Gradually increase ratio of trace-free examples from 0% to 100%

Compute: 8x H100 GPUs for data synthesis; Training uses 8x H100 (though model is 8B)

Comparison to Prior Work

vs. ToolBench: Optimizes the documentation rather than using it raw
vs. DRAFT/Play2Prompt: Trace-Free+ does NOT require execution at test time; it transfers learned patterns to new tools without interaction
vs. FIRE: Focuses on improving the static interface (description) rather than the agent's dynamic reasoning steps

Limitations

Relies on the existence of 'seed tools' that are functional to train the generator
Cannot fix parameter schema errors in the trace-free setting (only improves descriptions)
Performance gain depends on the quality of the underlying LLM's ability to generalize patterns

Reproducibility

Code: https://github.com/Ruocheng-Guo/Trace-Free-Plus

Code is publicly available at https://github.com/Ruocheng-Guo/Trace-Free-Plus. Dataset synthesis workflow is detailed in Section 3.1. Seed tools are derived from ToolBench but heavily filtered for availability.

📊 Experiments & Results

Evaluation Setup

Tool-use evaluation on unseen tools

Benchmarks:

StableToolBench (Multi-step tool use (simulated))
RestBench (RESTful API calls)

Metrics:

Solvable Pass Rate (Solvable SR)
Pass Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Trace-Free+ consistently improves tool use performance on unseen tools compared to using original documentation.
StableToolBench	Solvable SR	73.2	80.3	+7.1
RestBench (TMDB)	Pass Rate	66.7	77.1	+10.4
Ablation shows curriculum learning is essential; training only on trace-free data is suboptimal.
StableToolBench	Solvable SR	78.2	80.3	+2.1

Experiment Figures

Pass rate vs. Number of Tools. X-axis: Number of candidate tools (10 to 100). Y-axis: Pass Rate.

Main Takeaways

Optimizing tool descriptions is a viable alternative to fine-tuning agents, offering portability across different agent models.
The curriculum learning strategy effectively bridges the gap between trace-rich training and trace-free inference.
The method scales well: as the number of candidate tools increases (up to 100), the performance gap between Trace-Free+ and baselines widens, showing robustness to noise.

📚 Prerequisite Knowledge

Prerequisites

LLM-based tool use (agents calling APIs)
Supervised Fine-Tuning (SFT)
Curriculum Learning

Key Terms

Trace-free: A setting where the agent cannot execute tools to observe outputs/errors during the optimization phase

Cold-start: Deploying an agent with new tools that have no prior usage history or logs

RIMRULE: A method used in the paper's data pipeline to extract natural language rules from failed execution traces

ToolBench: A large-scale dataset of real-world RESTful APIs used for training and evaluation

Pass Rate: The percentage of test queries where the agent successfully completes the task

StableToolBench: An evaluation framework for tool-using agents that provides reliable metrics by caching real API responses

Seed tools: A filtered subset of tools known to be functional, used to generate synthetic training data