ToolQA: A Dataset for LLM Question Answering with External Tools

📝 Paper Summary

Benchmark datasets Agentic AI Tool-use evaluation

ToolQA is a benchmark designed to evaluate Large Language Models' ability to answer questions using external tools by ensuring reference data has minimal overlap with pre-training corpora.

Core Problem

Existing tool-use evaluations often fail to distinguish whether LLMs are truly using tools or simply recalling memorized facts from pre-training data.

Why it matters:

Evaluations biased by data contamination cannot accurately reflect a model's true reasoning and tool-use competency
LLMs suffer from hallucination and weak numerical reasoning when relying solely on internal weights
Distinguishing between memorization and actual problem-solving is critical for developing reliable agents

Concrete Example: If an LLM is asked about flight data, it might answer correctly using memorized historical schedules rather than querying a flight database tool. ToolQA prevents this by using recent or synthetic data (e.g., specific flight status on '01/09/22') that the model could not have memorized.

Key Novelty

ToolQA Benchmark Construction

Curates 8 reference corpora (text, tables, graphs) specifically selected to have minimal overlap with LLM pre-training data (e.g., recent logs, synthetic personal agendas)
Defines 13 specialized tools (SQL interpreter, graph loader, math calculator) required to extract answers from these corpora
Uses a template-based 'Human-Guided Question Generation' process where humans validate templates and algorithms instantiate them with specific values to ensure tool necessity

Architecture

The three-phase dataset curation process for ToolQA.

Evaluation Highlights

Standard ChatGPT and Chain-of-Thought (CoT) fail almost completely (<5% success) because they cannot access the external knowledge required
Tool-augmented ReAct outperforms baselines significantly on easy questions (43.15%) but struggles on hard questions (8.2%)
Hard questions involving complex reasoning and tool composition remain a major challenge for current state-of-the-art tool-use methods

Breakthrough Assessment

8/10

Addresses the critical issue of data contamination in tool-use evaluation. The rigorous construction process ensures models must use tools, providing a more faithful measure of agentic capability.

⚙️ Technical Details

Problem Definition

Setting: Open-ended question answering where answers must be derived from external reference corpora using a set of provided tools

Inputs: A tuple (question, reference corpora, list of available tools)

Outputs: The correct answer extracted or calculated from the reference corpora

Pipeline Flow

Reference Data Collection (gathering low-overlap corpora)
Human-Guided Question Generation (creating templates)
Programmatic Answer Generation (generating ground truth)

System Modules

Reference Data Collector (Data Construction)

Selects corpora across 6 dimensions (temporal, spatial, social, scientific, mathematical, personal) to minimize pre-training overlap

Model or implementation: N/A (Process)

Template Generator (Data Construction)

Generates candidate question templates using ChatGPT, which are then manually validated

Model or implementation: ChatGPT (gpt-3.5-turbo)

Answer Generator (Data Construction)

Calculates ground truth answers by executing code operators corresponding to tools

Model or implementation: Python scripts / Tool Chains

Novel Architectural Elements

Three-phase automated dataset curation process designed explicitly to minimize pre-training memorization overlap
Programmatic generation of ground truth answers using tool operators rather than human annotation

Modeling

Base Model: Evaluated on ChatGPT (gpt-3.5-turbo) and text-davinci-003

Training Method: In-context learning / Prompting (ReAct, Chameleon, CoT)

Adaptation: None (Inference-only evaluation)

Trainable Parameters: 0 (Frozen models)

Compute: Not reported in the paper

Comparison to Prior Work

vs. API-Bank/ToolBench: ToolQA focuses on open-ended QA correctness rather than intermediate API trace accuracy
vs. Standard QA Datasets (e.g., Natural Questions): ToolQA is explicitly designed to be unanswerable via internal knowledge (memorization)

Limitations

Evaluation relies on exact match of normalized answers, which may penalize correct but differently formatted responses
Hard questions have very low success rates (<10%), indicating current LLMs struggle significantly with complex tool composition
Specific details on the exact size or number of templates for every domain are not fully detailed in the main text

Reproducibility

Code: https://github.com/night-chen/ToolQA

Data and code are publicly available on GitHub. The paper details the specific tools (13 types) and the prompt strategies used (8 tool-level demonstrations).

📊 Experiments & Results

Evaluation Setup

Zero-shot or Few-shot (with tool demonstrations) Question Answering

Benchmarks:

ToolQA (Tool-augmented Question Answering) [New]

Metrics:

Success Rate (Exact match between normalized prediction and ground truth)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons on Easy vs. Hard questions in ToolQA show that methods relying on internal knowledge fail, while tool-augmented methods succeed on simple tasks but struggle with complex reasoning.
ToolQA (Easy Questions)	Success Rate	5	43.15	+38.15
ToolQA (Hard Questions)	Success Rate	2	8.2	+6.2
ToolQA (Easy Questions)	Success Rate	10.6	43.15	+32.55
ToolQA (Hard Questions)	Success Rate	1.9	8.2	+6.3

Main Takeaways

Standard LLMs (ChatGPT) and Chain-of-Thought cannot answer ToolQA questions, confirming the dataset successfully minimizes memorization/internal knowledge leaks.
Tool-augmented LLMs (ReAct) significantly outperform standard LLMs on easy questions (accessing single pieces of information).
Performance drops drastically for all models on 'Hard' questions, highlighting a gap in current models' ability to perform complex reasoning and multi-step tool composition.
ReAct outperforms Chameleon because it utilizes execution feedback to refine its next actions, whereas Chameleon plans without intermediate feedback.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and their pre-training data cutoffs
Familiarity with tool-augmented LLM frameworks (ReAct, Chameleon)
Basic concepts of question answering (QA) benchmarks

Key Terms

ReAct: Reasoning and Acting—a prompting strategy where LLMs generate reasoning traces and tool actions in an interleaved manner

Chain-of-Thought (CoT): A prompting method that encourages LLMs to generate intermediate reasoning steps before producing a final answer

Chameleon: A tool-augmented LLM method that uses a controller to compose tools for solving subtasks

Hallucination: When an LLM generates plausible but incorrect or ungrounded information

Reference Corpora: External datasets (text, tables, graphs) provided in ToolQA that contain the ground-truth information needed to answer questions

Programmatic Answer Generation: The process of generating ground-truth answers by running code (operators) that simulates correct tool usage on the reference data

ToolQA: The specific benchmark dataset introduced in this paper

GSM8K: A dataset of grade school math word problems, used here as a source for mathematical reference data