ART: Automatic multi-step reasoning and tool-use for large language models

📝 Paper Summary

Multi-call tool use with flexible plan Prompt-based w. optimization

ART enables frozen LLMs to solve new tasks by retrieving demonstrations from related tasks that teach the model how to decompose problems and use external tools without task-specific training.

Core Problem

Existing Chain-of-Thought (CoT) and tool-use methods typically require hand-crafting task-specific prompts or fine-tuning, making them difficult to scale to new, unseen tasks.

Why it matters:

Current approaches struggle to generalize zero-shot to complex multi-step reasoning tasks without explicit human supervision.
Manually writing CoT prompts and tool-use scripts for every new task is labor-intensive and fragile.
LLMs have severe limitations in math and up-to-date knowledge that tools can solve, but integrating them usually requires task-specific engineering.

Concrete Example: In a physics QA task, a standard LLM might hallucinate a formula. ART retrieves a program from a related task (e.g., math word problems) that demonstrates using a search engine to find formulas and a Python calculator to compute values, allowing the LLM to correctly solve the physics problem zero-shot.

Key Novelty

Automatic Reasoning and Tool-use (ART)

Maintains a library of 'programs'—structured demonstrations of multi-step reasoning and tool use—for a small seed set of tasks.
When given a new task, retrieves programs from related seed tasks to construct a dynamic few-shot prompt.
Uses a structured query language that allows the frozen LLM to pause generation, call external tools (search, code), and resume reasoning automatically.

Architecture

The ART framework workflow: Selecting demonstrations from a Task Library, generating reasoning steps with a Frozen LLM, pausing for Tool Use, and optional Human Feedback.

Evaluation Highlights

+12.3 percentage points improvement from tool-use on unseen BigBench test tasks compared to ART without tools.
Matches or outperforms automatic Chain-of-Thought (Auto-CoT) on 32/34 BigBench tasks and all MMLU tasks, averaging >22 percentage points higher.
Surpasses best published GPT-3 results by over 20 percentage points on select tasks when incorporating minimal human feedback (correcting 5 examples).

Breakthrough Assessment

8/10

Significantly advances zero-shot tool use by removing the need for task-specific prompt engineering. The framework's extensibility via human feedback is a strong practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning where a frozen LLM generates multi-step reasoning chains (programs) for unseen tasks using retrieved demonstrations.

Inputs: A new task description and input instance (e.g., 'Answer this physics question: Input: Hector yanks...').

Outputs: A structured reasoning program containing intermediate steps, tool calls, tool outputs, and a final answer.

Pipeline Flow

Task Retrieval (Find related programs)
Prompt Construction (Combine retrieved programs + new input)
Program Generation & Execution (LLM generates steps; System handles tool calls)

System Modules

Task Library

Store seed tasks with authored programs (decompositions) demonstrating tool use.

Model or implementation: Database of structured text examples

Prompt Builder

Format retrieved programs into a few-shot prompt for the LLM.

Model or implementation: Rule-based formatter

Generator

Generate the reasoning program step-by-step.

Model or implementation: InstructGPT (text-davinci-002)

Tool Executor

Parse tool calls, execute them, and return output to the Generator.

Model or implementation: External APIs / Python Environment

Novel Architectural Elements

Structured query language for interleaving free-text reasoning with symbolic tool calls in a single generation stream.
Cross-task retrieval mechanism that selects demonstrations from a library of different but related tasks to enable zero-shot generalization on new tasks.

Modeling

Base Model: InstructGPT (text-davinci-002)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Auto-CoT: ART incorporates external tools and structured programming, whereas Auto-CoT is free-form text only.
vs. Toolformer: ART uses a frozen LLM with few-shot prompting from a library, whereas Toolformer requires fine-tuning the model.
vs. Hand-crafted CoT: ART automates the prompt creation process via retrieval, removing the need for task-specific engineering.
+ 1 more
vs. ReAct [not cited in paper]: ART uses a structured library of cross-task demonstrations rather than requiring task-specific few-shot examples.

Limitations

Performance on code-heavy tasks (e.g., string manipulation) is limited by the underlying code generation model's errors.
Reliance on a fixed task library means performance depends on having somewhat related tasks available in the seed set.
The parser is rule-based and expects specific grammar, which the model must adhere to.

Reproducibility

Code: https://github.com/bhargaviparanjape/language-programmes/

Code is publicly available at https://github.com/bhargaviparanjape/language-programmes/. Uses OpenAI API (text-davinci-002, code-davinci-002) and SerpAPI. Task library programs are included.

📊 Experiments & Results

Evaluation Setup

Zero-shot transfer to new tasks (using demonstrations from other tasks) on BigBench and MMLU.

Benchmarks:

BigBench (Diverse NLP, math, and logic tasks)
MMLU (Multitask knowledge and reasoning)
Tool-use tasks (QA and Math (SQuAD, TriviaQA, SVAMP, etc.))

Metrics:

Exact Match / Accuracy
Statistical methodology: Reported performance averaged over 5 runs.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ART consistently improves over baselines on BigBench test tasks, especially in arithmetic and algorithmic domains.
BigBench (Average across test tasks)	Accuracy	47.7	54.6	+6.9
BigBench (Arithmetic tasks)	Accuracy	49.6	68.6	+19.0
BigBench (Arithmetic tasks)	Accuracy	31.9	68.6	+36.7
MMLU	Accuracy	54.84	66.45	+11.61
Tool-use ablation demonstrates significant gains from enabling external tools.
BigBench (Test tasks)	Accuracy	39.5	51.8	+12.3
Human feedback drastically improves performance on specific tasks by correcting reasoning logic or adding tools.
BigBench (Average of 12 tasks)	Accuracy	56.0	79.85	+23.85
ART outperforms Toolformer on standard QA and Math benchmarks.
SVAMP	Accuracy	29.4	76.2	+46.8

Experiment Figures

Two examples of human feedback improving ART: (a) adding a 'add unit' step for Physics QA, and (b) defining a new 'lookup' tool for word unscrambling.

Main Takeaways

ART effectively generalizes zero-shot to new tasks by leveraging a library of demonstrations from related tasks, removing the need for task-specific prompt engineering.
Tool use is a major driver of performance, contributing over 12 percentage points on average to unseen tasks, particularly in arithmetic where code execution prevents calculation errors.
The framework is highly amenable to human-in-the-loop improvements; correcting just 5 examples significantly boosts performance (+20% average) and allows easy integration of new tools like dictionary lookups.

📚 Prerequisite Knowledge

Prerequisites

In-context learning (few-shot prompting)
Chain-of-Thought (CoT) prompting
Basic understanding of Python for code generation tools

Key Terms

CoT: Chain of Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

LLM: Large Language Model—a deep learning model trained on vast amounts of text to generate human-like language.

Auto-CoT: Automatic Chain of Thought—a baseline method that automatically generates reasoning demonstrations using 'Let's think step-by-step' prompts.

PeG: Parsing Expression Grammar—a type of analytic formal grammar used here to define the structure of the reasoning programs.

Zero-shot: The ability of a model to perform a task without seeing any specific training examples for that exact task (though ART uses related task examples).

Few-shot: Providing the model with a small number of example input-output pairs in the prompt to guide its generation.

Self-consistency: A technique where the model generates multiple reasoning paths and selects the most frequent answer to improve reliability.

MMLU: Massive Multitask Language Understanding—a benchmark designed to measure knowledge acquired during pretraining.

BigBench: A diverse, collaborative benchmark for measuring the capabilities and limitations of large language models.

Codex: An LLM fine-tuned on code, used in this paper as the engine for generating Python code snippets.