GAIA: a benchmark for General AI Assistants

📝 Paper Summary

Benchmark datasets Metrics and evaluation Analysis

GAIA is a benchmark of conceptually simple but chemically hard-to-solve questions requiring reasoning, tool use, and multi-modality, on which humans excel but current advanced AI assistants fail significantly.

Core Problem

Current LLM benchmarks (like MMLU) are becoming saturated or target tasks difficult for humans (e.g., law exams), yet models still fail at conceptually simple real-world assistant tasks requiring multi-step planning and tool use.

Why it matters:

Evaluating open-ended generation is difficult and prone to bias when using model-based evaluation.
Existing benchmarks are prone to memorization and gameability.
There is a discrepancy between LLMs passing professional exams and failing basic assistant tasks like finding specific information on the web.
Tasks difficult for humans are not necessarily difficult for AI systems, and vice versa.

Concrete Example: Question: 'What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?' A model must browse to the NIH site, search for the specific trial, filter by date, and extract the count (90). GPT-4 fails this, while humans succeed easily.

Key Novelty

GAIA (General AI Assistants benchmark)

Focuses on questions that are conceptually simple for humans (92% success) but hard for AI (0-30% success), reversing the trend of seeking 'superhuman' difficulty benchmarks.
Questions require fundamental abilities: reasoning, multi-modality, web browsing, and tool proficiency, rather than just specialized knowledge.
Answers are factual, concise, and unambiguous (numbers, strings, or lists), enabling fast, robust, and exact-match automatic evaluation without model-based judges.

Evaluation Highlights

Human respondents achieve 92% accuracy on average, whereas GPT-4 equipped with plugins achieves only 15% on average.
GPT-4 without plugins scores 30% on the easiest tasks (Level 1) but 0% on the hardest (Level 3).
Web search alone (by humans) is slower and less effective for complex queries than a competent AI assistant could theoretically be, highlighting the potential utility of solving GAIA.

Breakthrough Assessment

9/10

Proposes a fundamental shift in evaluation philosophy (hard for AI, easy for humans) that exposes the 'stupidity' of current SOTA models on basic tasks. The clear, unambiguous evaluation metric solves a major pain point in agentic evaluation.

⚙️ Technical Details

Problem Definition

Setting: Open-ended question answering requiring multi-step reasoning, tool use, and multi-modality handling.

Inputs: A natural language question, optionally accompanied by a file (image, spreadsheet, PDF, audio).

Outputs: A single factual answer (string, number, or comma-separated list).

Pipeline Flow

Annotator creates question based on source of truth
Annotator provides answer and meta-data (steps, tools)
Two independent annotators validate the question and answer
Model prompts with question + evidence
Model executes steps (potentially using tools)
Model outputs strictly formatted final answer

System Modules

Question Creator (Dataset Creation)

Drafts unambiguous questions based on sources of truth (web, files)

Model or implementation: Human Annotators

Validation (Dataset Creation)

Verifies unambiguity and correctness of questions

Model or implementation: Human Annotators (x2)

Evaluation Prompt

Standardizes model output format for automated scoring

Model or implementation: Prefix Prompt

Novel Architectural Elements

Reliance on 'Proof of Work' concept for evaluation: tasks are hard to generate/solve but easy to verify via simple factual answers.
Design philosophy targeting tasks easy for humans (92%) but hard for AI, contrasting with 'superhuman' benchmarks.

Comparison to Prior Work

vs. MMLU/GSM8k: GAIA targets open-ended real-world tasks requiring tool use/web browsing, not just static knowledge or math.
vs. AgentBench: GAIA operates in the open world (live web) rather than simulated/closed environments.
vs. ToolQA: GAIA questions are hand-crafted and diverse to prevent gameability, rather than templated.
+ 1 more
vs. Human Eval [general]: GAIA allows automatic exact-match scoring due to unambiguous factoid answers, avoiding the cost/subjectivity of human judges.

Limitations

Missing evaluation of the reasoning trace; only the final answer is scored.
Reliance on the live web means questions may 'decay' if websites change or disappear (though robust sources were preferred).
Lack of linguistic diversity; restricted to English language and English-speaking web content.
Evaluation of closed-source assistants (GPT-4) is not fully reproducible due to changing plugins and model versions.

Reproducibility

Code: https://huggingface.co/gaia-benchmark

publicly available (https://huggingface.co/gaia-benchmark). 166 validation questions released with answers. 300 test questions released without answers. Scoring function provided. Evaluation relies on closed-source models (GPT-4) and changing web environment, making exact reproduction of baselines difficult over time.

📊 Experiments & Results

Evaluation Setup

Zero-shot prompting of AI assistants with questions and potential file attachments. Models expected to use tools (browser, code interpreter) to find answers.

Benchmarks:

GAIA (General Assistant Questions) [New]

Metrics:

Success Rate (Exact Match against ground truth)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human performance significantly outperforms all AI systems across all difficulty levels.
GAIA (Level 1)	Success Rate	92	30.3	-61.7
GAIA (Level 2)	Success Rate	93	9.8	-83.2
GAIA (Level 3)	Success Rate	92	0.0	-92.0
Augmenting GPT-4 with plugins improves performance over standard GPT-4 and AutoGPT.
GAIA (All Levels)	Success Rate	9.3	14.9	+5.6
GAIA (Level 2)	Success Rate	1.5	9.8	+8.3

Main Takeaways

Tool augmentation (web browsing, code interpretation) is critical: GPT-4 with plugins outperforms base GPT-4, unlocking new capabilities.
Current 'autonomous' agents like AutoGPT perform poorly compared to manually guided plugin use, struggling with Level 2 tasks.
The benchmark effectively stratifies difficulty: models degrade sharply from Level 1 to Level 3, while humans maintain ~92% accuracy throughout.
Web search engines alone are insufficient baselines for complex queries, as answers often require synthesizing information from multiple pages or files.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and their limitations
Understanding of tool-augmented LLMs (plugins, web browsing, code interpreters)
Knowledge of current AI benchmarking landscape (MMLU, GSM8k)

Key Terms

t-AGI: A system that beats most human experts who are given time t to perform a task.

Level 1: Questions requiring no tools or at most one tool and no more than 5 steps.

Level 2: Questions generally involving between 5 and 10 steps and requiring the combination of different tools.

Level 3: Questions for a near-perfect general assistant, requiring arbitrarily long sequences of actions and tool use.

Exact Match: Evaluation metric where the model's output must precisely match the ground truth (up to normalization).

Advanced Data Analysis: A GPT-4 mode allowing code execution and file reading (formerly Code Interpreter).

AutoGPT: An open-source application attempting to make GPT-4 fully autonomous.