LLM Agents Making Agent Tools

📝 Paper Summary

Self-evolving Agentic reasoning Multi-step tool use with flexible plan

TOOLMAKER is an agentic framework that autonomously installs, configures, and wraps complex scientific code repositories into executable tools for LLMs using a closed-loop self-correction mechanism.

Core Problem

LLM agents struggle with complex scientific tasks because they rely on pre-existing, manually implemented tools, and existing tool-creation methods cannot handle the complexity of installing external dependencies or interacting with the operating system.

Why it matters:

Scientific discovery requires specialized, complex software (e.g., in genomics or pathology) that general-purpose LLM agents cannot use without manual integration
Privacy restrictions in healthcare often prevent agents from building tools from scratch using sensitive data
Existing agents like AIDE train simple models from scratch instead of using state-of-the-art foundation models available in public repositories

Concrete Example: When tasked to 'predict a biomarker from a whole slide image', a standard agent might try to train a simple CNN from scratch (yielding poor results). A researcher would instead use a specialized pipeline like STAMP. TOOLMAKER autonomously installs the STAMP repository, downloads dependencies, and wraps it into a tool the agent can use.

Key Novelty

Autonomous conversion of paper repositories into executable tools

Treats tool creation as a two-stage process: first autonomously setting up the execution environment (installing dependencies via Docker), then implementing the Python interface
Uses a closed-loop self-improvement cycle where the agent executes its candidate tool, diagnoses errors (reading logs/files), and iteratively fixes the implementation until it works

Architecture

The TOOLMAKER workflow showing the separation of environment setup and the closed-loop implementation cycle

Evaluation Highlights

Correctly implements 80% (12/15) of complex scientific tasks in the new TM-BENCH benchmark, compared to 20% (3/15) for the SOTA software engineering agent OpenHands
Passes 116/124 unit tests across diverse domains (pathology, radiology, omics), significantly outperforming OpenHands which passed only 31/124 tests
Demonstrates cost-effectiveness, averaging $0.94 per tool creation while handling complex multi-step installations involving GPU dependencies

Breakthrough Assessment

8/10

Significantly advances agentic capabilities by enabling agents to 'build their own tools' from existing complex software rather than just writing simple Python functions. Strong empirical results on a hard, realistic benchmark.

⚙️ Technical Details

Problem Definition

Setting: Given a task description, a paper, and a GitHub URL, generate an executable tool definition (environment + code)

Inputs: Task description, GitHub URL, list of input arguments with example invocation

Outputs: A Docker image (execution environment) and a Python function implementing the task

Pipeline Flow

Install Repository (creates environment snapshot)
Explore & Plan (analyzes code/docs)
Implement (writes initial Python function)
Self-Correction Loop (Execute → Assess → Diagnose → Re-implement)

System Modules

Install Repository Agent

Clones repository, reads documentation, and performs bash commands to install dependencies and download models

Model or implementation: gpt-4o-2024-08-06

Explore & Plan (Tool Implementation)

Analyzes the repository structure and creates a step-by-step implementation plan

Model or implementation: gpt-4o-2024-08-06

Implement & Self-Correct (Tool Implementation)

Generates Python code, executes it, diagnoses errors from logs, and iteratively fixes the code

Model or implementation: gpt-4o-2024-08-06

Novel Architectural Elements

Stateful execution environment with rollback: Uses Docker checkpointing to reset the environment to a 'fresh install' state before every test run, preventing accumulated side effects
Two-stage separation: Explicitly separates 'Environment Setup' (installing heavy dependencies/models) from 'Tool Implementation' (writing the interface logic)

Modeling

Base Model: gpt-4o-2024-08-06

Compute: Execution environment requires Docker. Inference costs averaged $0.94 per tool using gpt-4o.

Comparison to Prior Work

vs. OpenHands: TOOLMAKER uses a specialized two-stage workflow (install vs. implement) and strict environment rollback, whereas OpenHands often produces invalid environment definitions
vs. CREATOR/LATM: TOOLMAKER handles complex dependencies (libraries, models) and OS interaction (files, bash), whereas prior tool makers only generate standalone Python code
vs. AIDE [not cited in paper]: TOOLMAKER utilizes existing repositories rather than training models from scratch

Limitations

Depends on the quality and documentation of the source repositories; cannot fix fundamentally broken or empty repositories
Potential security risks in autonomously executing unverified code from the internet
Benchmark correctness relies on passing unit tests, which may not capture all edge cases or guarantee full scientific validity
Does not address physical experimentation, only computational/in-silico tools

Reproducibility

Code: https://github.com/KatherLab/ToolMaker

📊 Experiments & Results

Evaluation Setup

Autonomous generation of tools from 15 GitHub repositories across medical and general domains

Benchmarks:

TM-BENCH (Tool Creation (Environment Setup + Implementation)) [New]

Metrics:

Success Rate (Task Implementation)
Pass Rate (Unit Tests)
Cost per Tool ($)
Number of Self-Correction Iterations
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TOOLMAKER vastly outperforms OpenHands on the TM-BENCH tasks, both in overall success rate and specific unit test pass rates.
TM-BENCH	Tasks Correctly Implemented	3	12	+9
TM-BENCH	Unit Tests Passed	31	116	+85
TM-BENCH	Average Cost per Tool ($)	0.15	0.94	+0.79
TM-BENCH	Tasks Correctly Implemented	12	9	-3
TM-BENCH	Tasks Correctly Implemented	12	11	-1

Main Takeaways

Separating environment installation from tool implementation is crucial for complex scientific tasks
OpenHands fails primarily at the environment setup stage (invalid Dockerfiles, missing dependencies)
Self-correction allows TOOLMAKER to handle multi-step tasks (e.g., feature extraction followed by training) that require intermediate file handling
Using paper summaries in prompts reduces token usage and cost but does not improve (and slightly hurts) success rate

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM agents and tool use
Familiarity with Docker and containerization
Knowledge of software engineering workflows (installation, dependency management, unit testing)

Key Terms

TOOLMAKER: The proposed agentic framework that transforms code repositories into LLM-compatible tools

TM-BENCH: A new benchmark introduced in this paper comprising 15 complex scientific tasks to evaluate tool creation agents

OpenHands: A state-of-the-art software engineering agent used as a baseline comparison

Docker: A platform for developing, shipping, and running applications in containers; used here to create reproducible execution environments

unit tests: Automated tests that verify if a specific section of code (the generated tool) meets design requirements and behaves as expected

environment state: The condition of the execution environment (file system, installed packages) at a given point in time, managed via Docker checkpoints

foundation models: Large-scale pre-trained models (like CLIP or pathology encoders) that the agents must download and utilize

closed-loop self-correction: A feedback mechanism where the agent runs its code, observes errors, and autonomously refines the code to fix issues