← Back to Paper List

Learning Evolving Tools for Large Language Models

Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Yesai Wu, Yankai Lin, Wenzheng Feng, Yasheng Wang
Institute of Computing Technology, Chinese Academy of Sciences, Tsinghua University, Renmin University of China, Huawei Noah’s Ark Lab
arXiv (2024)
Agent Benchmark RL

📝 Paper Summary

Self-evolving Agentic reasoning RL-based
TOOLEVO enables LLMs to adapt to changing tools (API updates) by actively exploring dynamic environments via MCTS, reflecting on errors, and autonomously updating their own tool definitions.
Core Problem
Real-world APIs change frequently (names, parameters, response formats), causing LLMs trained on static documentation to fail when the deployed tools diverge from their training data.
Why it matters:
  • Static tool learning approaches (SFT on fixed datasets) develop stereotypes and fail catastrophically when APIs are updated or deprecated
  • Manually updating tool documentation and retraining models in real-time is resource-intensive and often impractical
Concrete Example: An LLM trained to use 'RetrieveAgenda' with a 'keyword' parameter fails when the API is updated to 'Fetch_Agenda_Data' requiring a 'Query' parameter. Static models keep retrying the old format, while TOOLEVO detects the error, explores, and updates its internal usage.
Key Novelty
Self-Evolving Tool Learning via MCTS
  • Treats tool use as a search problem where the LLM explores dynamic environments using MCTS to find working API calls despite outdated instructions
  • Implements a 'Tool-Update' mechanism where the agent reflects on error messages (e.g., deprecation warnings) to rewrite its own prompt-based tool definitions
Architecture
Architecture Figure Figure 1 (Right)
Overview of TOOLEVO framework using MCTS. Shows the cycle of Selection, Expansion (with API invocation), and Backpropagation, integrated with Self-Reflection and Tool Updates.
Evaluation Highlights
  • +28.8% accuracy improvement over Static-SFT on the ToolQA-D-Hard benchmark in out-of-distribution (OOD) dynamic environments
  • Achieves superior stability, maintaining high performance across static, dynamic, and OOD settings, whereas Static-SFT performance degrades significantly in dynamic settings
  • Outperforms GPT-4 by 21% on average in the OOD dynamic environment setting
Breakthrough Assessment
8/10
Addresses a critical, overlooked problem (tool evolution/drift) with a robust MCTS-based solution. The ability to autonomously update tool definitions during inference is a significant step toward self-sustaining agents.
×