← Back to Paper List

Meta-Reasoning Improves Tool Use in Large Language Models

Lisa Alazraki, Marek Rei
Imperial College London
arXiv (2024)
Agent Reasoning Benchmark

📝 Paper Summary

Multi-call tool use with fixed plan Tool-use post-training
Tecton improves LLM tool use by first gathering diverse candidate tools via a custom fine-tuned head, then using the frozen base model to meta-reason and select the best candidate.
Core Problem
Existing tool-augmented LLMs typically select tools via greedy decoding, which often misses the correct tool when it has slightly lower probability than the top choice.
Why it matters:
  • Math reasoning tasks require long chains of tool calls where errors compound, making brittle greedy selection a major failure point
  • Full fine-tuning for tools is computationally expensive and binds the model to a fixed toolset, while in-context learning is limited by context window size
  • Current methods fail to leverage the general reasoning capabilities of the base LLM to double-check or 'meta-reason' about specific tool choices made by specialized heads
Concrete Example: In a math problem, a model might greedily select a 'Subtract' tool when 'Divide' was the correct operation but had slightly lower probability. Tecton would capture 'Divide' in the top-k candidates and let the base model re-evaluate the context to select it.
Key Novelty
Tool selection via meta-reasoning (Tecton)
  • Splits tool use into two phases: a 'Reasoning' phase using a specialized, fine-tuned head to propose multiple candidate tools (top-k), and a 'Meta-Reasoning' phase using the frozen base LLM to select the best one
  • Treats tool selection as a multiple-choice meta-reasoning task for the generalist LLM, rather than a single-shot generation task for the specialist head
  • Introduces dynamic retrieval of tool demonstrations during the meta-reasoning phase to guide the frozen model's selection without retraining
Architecture
Architecture Figure Figure 1
The two-phase Tecton framework. Phase 1 (Reasoning): The model with a tuned head generates candidate tools from the top-k probabilities. Phase 2 (Meta-Reasoning): The tuned head is disabled, and the frozen base model selects the best tool from the candidates.
Evaluation Highlights
  • +20.6% accuracy on FuncQA Multi-Hop (Tecton-generate) vs. 10.1% for ToolkenGPT, doubling performance on this challenging benchmark
  • Achieves gains of ~8.5 percentage points on average over ToolkenGPT across three out-of-distribution math datasets (ASDiv-XL, MAWPS-XL, SVAMP-XL)
  • Outperforms Trice by 7.2 percentage points on the in-distribution GSM8K-XL dataset
Breakthrough Assessment
7/10
Strong empirical gains on math reasoning, particularly out-of-distribution. The two-phase separation of candidate generation (specialist) and selection (generalist) is a clever architectural insight for tool use.
×