Evaluation Setup
Zero-shot prompting of AI assistants with questions and potential file attachments. Models expected to use tools (browser, code interpreter) to find answers.
Benchmarks:
- GAIA (General Assistant Questions) [New]
Metrics:
- Success Rate (Exact Match against ground truth)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Human performance significantly outperforms all AI systems across all difficulty levels. |
| GAIA (Level 1) |
Success Rate |
92 |
30.3 |
-61.7
|
| GAIA (Level 2) |
Success Rate |
93 |
9.8 |
-83.2
|
| GAIA (Level 3) |
Success Rate |
92 |
0.0 |
-92.0
|
| Augmenting GPT-4 with plugins improves performance over standard GPT-4 and AutoGPT. |
| GAIA (All Levels) |
Success Rate |
9.3 |
14.9 |
+5.6
|
| GAIA (Level 2) |
Success Rate |
1.5 |
9.8 |
+8.3
|
Main Takeaways
- Tool augmentation (web browsing, code interpretation) is critical: GPT-4 with plugins outperforms base GPT-4, unlocking new capabilities.
- Current 'autonomous' agents like AutoGPT perform poorly compared to manually guided plugin use, struggling with Level 2 tasks.
- The benchmark effectively stratifies difficulty: models degrade sharply from Level 1 to Level 3, while humans maintain ~92% accuracy throughout.
- Web search engines alone are insufficient baselines for complex queries, as answers often require synthesizing information from multiple pages or files.