← Back to Paper List

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Pan Lu, Bowen Chen, Sheng Liu, R. Thapa, Joseph Boen, James Zou
Stanford University
arXiv.org (2025)
Agent MM Reasoning Benchmark

📝 Paper Summary

Multi-call tool use with flexible plan Agentic AI Frameworks
OctoTools is a training-free framework that enables LLMs to solve complex reasoning tasks by separating high-level planning from low-level execution and using standardized tool cards for easy extensibility.
Core Problem
Existing methods for augmenting LLMs with tools are often restricted to specialized domains, limited tool types, or require extensive training data/fine-tuning to learn tool usage.
Why it matters:
  • Current prompting methods fail to orchestrate complex reasoning steps involving visual understanding, domain knowledge, and calculation into a coherent chain.
  • Many frameworks require substantial training data, limiting adaptability to new domains, or are hard-coded for specific tools, restricting generality.
  • Providing too many tools without optimization can introduce noise and slow down performance in agentic systems.
Concrete Example: Solving a visual riddle might require fine-grained image understanding combined with text-based reasoning, while a math question needs precise computation. A standard LLM prompting approach often fails to coordinate these distinct processes, leading to hallucinated or logically disjointed answers.
Key Novelty
OctoTools: Planner-Executor with Tool Cards
  • Standardized 'Tool Cards' encapsulate heterogeneous tools (Python scripts, APIs) with metadata and usage constraints, allowing easy plug-and-play integration without re-engineering the agent.
  • Separates the 'Planner' (strategizes sub-goals and selects tools) from the 'Executor' (generates and runs executable code), reducing errors compared to models trying to do both simultaneously.
  • Includes a lightweight 'Toolbox Optimizer' that identifies the most effective subset of tools for a specific task using a small validation set, avoiding the noise of irrelevant tools.
Architecture
Architecture Figure Figure 1
The conceptual framework of OctoTools, illustrating the flow from user query to final solution via the Planner and Executor.
Evaluation Highlights
  • Achieves an average accuracy gain of 9.3% over GPT-4o (zero-shot) across 16 diverse reasoning benchmarks (MathVista, MMLU-Pro, MedQA, etc.).
  • Outperforms existing agentic frameworks (AutoGen, GPT-Functions, LangChain) by up to 10.6% when provided with the same set of tools.
  • Surpasses Chain-of-Thought (CoT) prompting baselines by an average of 7.7% across the evaluated tasks.
Breakthrough Assessment
8/10
Strong empirical gains across a very wide range of tasks (16 benchmarks) without any training. The modular tool card design addresses a major pain point in agent extensibility.
×