← Back to Paper List

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, Dongzhan Zhou
Shanghai Artificial Intelligence Laboratory, Fudan University, Xiamen University, University of Macau, Tsinghua University, Hangzhou Dianzi University
arXiv (2026)
Agent Reasoning Benchmark

📝 Paper Summary

Self-evolving Agentic reasoning Dynamic tool synthesis
Test-Time Tool Evolution (TTE) enables scientific agents to dynamically synthesize, verify, and refine executable tools during inference, overcoming the limitations of static pre-defined tool libraries in sparse scientific domains.
Core Problem
Static tool libraries fail in scientific domains because required tools are extremely sparse, heterogeneous, and often bespoke to novel inquiries, making manual pre-definition impossible.
Why it matters:
  • Scientific research demands precise executable rigor that LLMs lack without tools, yet standard static libraries cannot cover the infinite tail of specific scientific functions
  • Current agents act as passive selectors limited by existing APIs, preventing them from solving novel problems that require inventing new computational primitives
Concrete Example: A chemist might need a specific calculation for a novel molecular property that doesn't exist in standard libraries like ChemCrow. A static agent fails because the tool is missing; TTE dynamically writes and verifies a Python function to perform this specific calculation during the reasoning process.
Key Novelty
Test-Time Tool Evolution (TTE)
  • Treats tool creation as an online optimization problem during inference, where the agent generates, executes, and refines code into reusable tools on the fly
  • Implements a 'Generate-Verify-Refine' loop that decomposes generated code into atomic 'cell tools' to maximize reusability across future problems
  • Maintains a dynamic tool library that evolves from scratch (Tabula Rasa) or adapts from one domain to another by pruning low-utility tools
Architecture
Architecture Figure Figure 2
The closed-loop evolutionary workflow of TTE, detailing the five modules and their interaction
Evaluation Highlights
  • TTE-Zero (starting from empty library) outperforms static tool baselines like ChemCrow and SciAgent in accuracy on the SciEvo benchmark
  • Achieves higher Tool Reuse Rate (TRR) than generative baselines, indicating synthesized tools are high-quality reusable assets rather than disposable scripts
  • Successfully adapts tools from Materials Science to Chemistry (TTE-Adapt), demonstrating cross-domain transfer of computational primitives
Breakthrough Assessment
8/10
Strong conceptual shift from static tool retrieval to dynamic tool evolution. The introduction of atomic refinement and a dedicated benchmark (SciEvo) for this paradigm makes it a significant contribution to AI for Science.
×