← Back to Paper List

GAIA: a benchmark for General AI Assistants

G. Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom
Fundamental AI Research, Meta, HuggingFace, AutoGPT, GenAI, Meta
arXiv.org (2023)
Benchmark Agent MM Reasoning

📝 Paper Summary

Benchmark datasets Metrics and evaluation Analysis
GAIA is a benchmark of conceptually simple but chemically hard-to-solve questions requiring reasoning, tool use, and multi-modality, on which humans excel but current advanced AI assistants fail significantly.
Core Problem
Current LLM benchmarks (like MMLU) are becoming saturated or target tasks difficult for humans (e.g., law exams), yet models still fail at conceptually simple real-world assistant tasks requiring multi-step planning and tool use.
Why it matters:
  • Evaluating open-ended generation is difficult and prone to bias when using model-based evaluation.
  • Existing benchmarks are prone to memorization and gameability.
  • There is a discrepancy between LLMs passing professional exams and failing basic assistant tasks like finding specific information on the web.
  • Tasks difficult for humans are not necessarily difficult for AI systems, and vice versa.
Concrete Example: Question: 'What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?' A model must browse to the NIH site, search for the specific trial, filter by date, and extract the count (90). GPT-4 fails this, while humans succeed easily.
Key Novelty
GAIA (General AI Assistants benchmark)
  • Focuses on questions that are conceptually simple for humans (92% success) but hard for AI (0-30% success), reversing the trend of seeking 'superhuman' difficulty benchmarks.
  • Questions require fundamental abilities: reasoning, multi-modality, web browsing, and tool proficiency, rather than just specialized knowledge.
  • Answers are factual, concise, and unambiguous (numbers, strings, or lists), enabling fast, robust, and exact-match automatic evaluation without model-based judges.
Evaluation Highlights
  • Human respondents achieve 92% accuracy on average, whereas GPT-4 equipped with plugins achieves only 15% on average.
  • GPT-4 without plugins scores 30% on the easiest tasks (Level 1) but 0% on the hardest (Level 3).
  • Web search alone (by humans) is slower and less effective for complex queries than a competent AI assistant could theoretically be, highlighting the potential utility of solving GAIA.
Breakthrough Assessment
9/10
Proposes a fundamental shift in evaluation philosophy (hard for AI, easy for humans) that exposes the 'stupidity' of current SOTA models on basic tasks. The clear, unambiguous evaluation metric solves a major pain point in agentic evaluation.
×