Large Language Models as Tool Makers

📝 Paper Summary

Tool creation Self-evolving Agentic reasoning

LATM separates reasoning into a one-time 'tool-making' phase by a powerful model and a repetitive 'tool-using' phase by a lightweight model, enabling cost-effective problem solving.

Core Problem

Existing methods rely on pre-defined tools or use expensive models (like GPT-4) for every inference step, which is cost-prohibitive and inefficient for recurring complex tasks.

Why it matters:

Lightweight models (e.g., GPT-3.5) often fail at complex arithmetic or algorithmic reasoning tasks where powerful models succeed but cost too much.
Current caching systems only store text responses, missing the opportunity to cache reusable logic (tools) that can solve functionally similar but textually different requests.

Concrete Example: In the 'Dyck Language' task (matching brackets), GPT-3.5 Turbo with Chain-of-Thought (CoT) fails dramatically (20.4% accuracy) because it loses track of state. GPT-4 can solve it but is expensive. LATM has GPT-4 write a Python bracket-checking function once, which GPT-3.5 then calls to solve new instances with 92.2% accuracy.

Key Novelty

LLMs As Tool Makers (LATM)

Mimics human evolution by enabling LLMs to fabricate their own reusable tools (Python functions) rather than just using existing ones.
Divides labor: A heavy, 'smart' model (Tool Maker) creates the tool once, and a light, 'cheap' model (Tool User) reuses it for all subsequent requests.
Introduces 'Functional Caching': Storing the generated tool (logic) to handle a class of future requests, rather than just caching static text answers.

Evaluation Highlights

+71.8% accuracy on BigBench Dyck Language task using GPT-3.5 Turbo as the tool user compared to standard Chain-of-Thought prompting.
+38.0% accuracy on BigBench Tracking Shuffled Objects task using GPT-3.5 Turbo with generated tools compared to CoT.
Achieves performance equivalent to GPT-4 on reasoning tasks while significantly reducing inference costs by delegating execution to GPT-3.5.

Breakthrough Assessment

8/10

Significantly shifts the paradigm from 'tool use' to 'tool creation,' offering a practical solution to the cost/performance trade-off in deploying LLMs for complex logic.

⚙️ Technical Details

Problem Definition

Setting: Given a task with demonstrations, generate a reusable Python function (tool) and use it to solve new instances of that task.

Inputs: Task demonstrations (input-output pairs) and a new query

Outputs: Python function (tool) and the execution result for the new query

Pipeline Flow

Dispatcher: Incoming Task → Check Cache
Branch A (Hit): Tool User → Execute Existing Tool
Branch B (Miss): Tool Maker → Create New Tool → Tool User

System Modules

Dispatcher

Determines if a cached tool exists for the incoming task type or if a new tool is needed

Model or implementation: Lightweight Model (e.g., GPT-3.5 Turbo)

Tool Maker

Generates a generic Python function to solve the task based on demonstrations

Model or implementation: Powerful Model (e.g., GPT-4)

Tool User

Uses the provided tool to solve a specific instance of the task

Model or implementation: Lightweight Model (e.g., GPT-3.5 Turbo)

Novel Architectural Elements

Separation of 'Maker' and 'User' roles utilizing different model classes (heavy vs. light) for cost optimization.
Introduction of the 'Tool Wrapping' stage where the maker generates usage demonstrations for the user.
Functional Cache architecture allowing the storage of algorithmic capabilities (tools) rather than just static text.

Modeling

Base Model: GPT-4 (Tool Maker) and GPT-3.5 Turbo (Tool User)

Compute: Not reported in the paper (Inference-only approach using APIs)

Comparison to Prior Work

vs. CoT: LATM generates code to solve the logic, avoiding the hallucination and state-tracking errors of text-based reasoning.
vs. Chameleon: LATM emphasizes 'Making' over just 'Using', and creates tools that handle the full task logic for reuse.
vs. CREATOR: CREATOR also generates tools, but LATM explicitly optimizes for the cost-saving split between a heavy Maker and a light User [not cited in paper].

Limitations

Reliant on the capability of the Tool Maker (GPT-4) to write correct Python code; if the Maker fails, the User cannot solve the task.
Applicability is limited to tasks that can be algorithmically solved via Python functions (e.g., logic, arithmetic) rather than open-ended creative writing.
The Tool Verification stage relies on the model's ability to self-debug, which is not always perfect.

Reproducibility

Code: https://github.com/ctlllll/LLM-ToolMaker

publicly available (https://github.com/ctlllll/LLM-ToolMaker). The paper provides detailed prompt structures in Appendix C. Used OpenAI API models (GPT-4, GPT-3.5 Turbo).

📊 Experiments & Results

Evaluation Setup

Evaluated on complex reasoning tasks requiring algorithmic logic.

Benchmarks:

BigBench (Diverse reasoning tasks (Logical Deduction, Tracking Shuffled Objects, Dyck Language, etc.))
Scheduling Meeting (Real-world scenario simulation) [New]

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LATM significantly outperforms standard Chain-of-Thought (CoT) prompting when using GPT-3.5 Turbo, often matching or exceeding the logic capabilities of pure text generation.
Dyck Language	Accuracy	20.4	92.2	+71.8
Tracking Shuffled Objects (5)	Accuracy	61.6	99.6	+38.0
Logical Deduction (5)	Accuracy	66.4	79.7	+13.3
Word Sorting	Accuracy	59.2	98.3	+39.1

Main Takeaways

Lightweight models (GPT-3.5) equipped with high-quality tools generated by powerful models (GPT-4) can match or outperform powerful models acting alone.
The 'Functional Cache' mechanism amortizes the high cost of the Tool Maker across many instances, drastically reducing average serving costs.
Algorithmic tasks (sorting, shuffling, logic puzzles) show the highest gains because Python tools naturally handle the strict logic that text-based CoT struggles with.

📚 Prerequisite Knowledge

Prerequisites

In-context learning / Few-shot prompting
Python programming concepts (functions, unit tests)
Chain-of-Thought (CoT) prompting

Key Terms

LATM: LLMs As Tool Makers—the proposed framework where models generate their own tools to solve tasks.

PbE: Programming by Example—a paradigm where the model generates a program based on input-output examples.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

Functional Cache: A caching mechanism that stores reusable tools (logic) capable of processing new inputs, unlike traditional caches that store static text responses.

Dispatcher: A lightweight model component that decides whether an incoming request can be solved by an existing tool or requires a new tool to be made.

BigBench: A collaborative benchmark for measuring the capabilities of large language models across diverse tasks.

Dyck Language: A task involving checking the correct nesting of brackets/parentheses, often used to test recursive reasoning.

Tool Maker: The powerful LLM (e.g., GPT-4) responsible for generating, verifying, and wrapping the Python tool.

Tool User: The lightweight LLM (e.g., GPT-3.5) responsible for calling the pre-made tool to solve specific instances.