← Back to Paper List

Large Language Models as Tool Makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou
Google Deepmind, Princeton University, Stanford University
arXiv (2023)
Agent Reasoning Benchmark

📝 Paper Summary

Tool creation Self-evolving Agentic reasoning
LATM separates reasoning into a one-time 'tool-making' phase by a powerful model and a repetitive 'tool-using' phase by a lightweight model, enabling cost-effective problem solving.
Core Problem
Existing methods rely on pre-defined tools or use expensive models (like GPT-4) for every inference step, which is cost-prohibitive and inefficient for recurring complex tasks.
Why it matters:
  • Lightweight models (e.g., GPT-3.5) often fail at complex arithmetic or algorithmic reasoning tasks where powerful models succeed but cost too much.
  • Current caching systems only store text responses, missing the opportunity to cache reusable logic (tools) that can solve functionally similar but textually different requests.
Concrete Example: In the 'Dyck Language' task (matching brackets), GPT-3.5 Turbo with Chain-of-Thought (CoT) fails dramatically (20.4% accuracy) because it loses track of state. GPT-4 can solve it but is expensive. LATM has GPT-4 write a Python bracket-checking function once, which GPT-3.5 then calls to solve new instances with 92.2% accuracy.
Key Novelty
LLMs As Tool Makers (LATM)
  • Mimics human evolution by enabling LLMs to fabricate their own reusable tools (Python functions) rather than just using existing ones.
  • Divides labor: A heavy, 'smart' model (Tool Maker) creates the tool once, and a light, 'cheap' model (Tool User) reuses it for all subsequent requests.
  • Introduces 'Functional Caching': Storing the generated tool (logic) to handle a class of future requests, rather than just caching static text answers.
Evaluation Highlights
  • +71.8% accuracy on BigBench Dyck Language task using GPT-3.5 Turbo as the tool user compared to standard Chain-of-Thought prompting.
  • +38.0% accuracy on BigBench Tracking Shuffled Objects task using GPT-3.5 Turbo with generated tools compared to CoT.
  • Achieves performance equivalent to GPT-4 on reasoning tasks while significantly reducing inference costs by delegating execution to GPT-3.5.
Breakthrough Assessment
8/10
Significantly shifts the paradigm from 'tool use' to 'tool creation,' offering a practical solution to the cost/performance trade-off in deploying LLMs for complex logic.
×