TOOLMAKER: The proposed agentic framework that transforms code repositories into LLM-compatible tools
TM-BENCH: A new benchmark introduced in this paper comprising 15 complex scientific tasks to evaluate tool creation agents
OpenHands: A state-of-the-art software engineering agent used as a baseline comparison
Docker: A platform for developing, shipping, and running applications in containers; used here to create reproducible execution environments
unit tests: Automated tests that verify if a specific section of code (the generated tool) meets design requirements and behaves as expected
environment state: The condition of the execution environment (file system, installed packages) at a given point in time, managed via Docker checkpoints
foundation models: Large-scale pre-trained models (like CLIP or pathology encoders) that the agents must download and utilize
closed-loop self-correction: A feedback mechanism where the agent runs its code, observes errors, and autonomously refines the code to fix issues