From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

📝 Paper Summary

LLM and Agentic AI Benchmarks Multi-Agent Collaboration Protocols Agentic RAG AI Agent Frameworks

This survey unifies the fragmented landscape of autonomous AI agents by categorizing over 60 benchmarks, reviewing collaboration protocols, and mapping applications from scientific discovery to software engineering.

Core Problem

The rapid evolution of LLMs into autonomous agents has created a fragmented landscape of evaluation benchmarks, frameworks, and protocols that lacks a unified taxonomy.

Why it matters:

Current benchmarks are often outdated or narrow; state-of-the-art models fail on newer reasoning tasks (e.g., <10% on Humanity's Last Exam)
Researchers lack a consolidated view of how agent frameworks (planning, reflection) integrate with specific domain applications like healthcare or materials science
Security vulnerabilities and collaboration failures in multi-agent systems are under-explored in existing literature

Concrete Example: While models achieve >90% on traditional benchmarks like MMLU, they score less than 10% on the Humanity's Last Exam (HLE) benchmark, revealing a massive gap in expert-level reasoning that older metrics hide.

Key Novelty

Unified Taxonomy of Agentic AI (Benchmarks, Protocols, Applications)

Side-by-side comparison of benchmarks from 2019 to 2025, categorized by domain (reasoning, coding, embodied, etc.)
Review of agent-to-agent collaboration protocols including ACP, MCP, and A2A
Detailed mapping of real-world agent applications across scientific domains (e.g., biomedical research, chemical reasoning)

Architecture

Survey Structure Diagram

Evaluation Highlights

State-of-the-art systems achieve only ~7% accuracy on standard ENIGMAEVAL puzzles, highlighting failures in multimodal reasoning
On Humanity's Last Exam (HLE), advanced models like DeepSeek-R1 and Claude 3.5 Sonnet score below 10%, showing significant calibration errors
Agent-as-a-Judge framework reduces evaluation costs to ~2.29% of human costs while achieving 90% alignment with human judgments

Breakthrough Assessment

7/10

A comprehensive and timely survey that organizes a chaotic field. While it doesn't propose a new model, its taxonomy and consolidation of 2024-2025 benchmarks provide a critical roadmap for researchers.

⚙️ Technical Details

Problem Definition

Setting: Survey and taxonomy construction for Large Language Model (LLM) agents and their evaluation ecosystems

Inputs: Literature from 2019–2025 covering benchmarks, frameworks, and protocols

Outputs: Taxonomy of 60 benchmarks, review of collaboration protocols, and analysis of open challenges

Pipeline Flow

Review of Related Works (Software Engineering, Multi-Agent Systems)
Benchmark Analysis (Tabular comparison of ~60 datasets)
Framework & Application Review (Agent architectures, domain usage)
Protocol Survey (ACP, MCP, A2A)
Challenges Discussion (Security, Reliability)

System Modules

Benchmark Taxonomy (Analysis)

Categorize evaluation datasets

Model or implementation: N/A (Survey)

Protocol Survey (Analysis)

Review communication standards

Model or implementation: N/A (Survey)

Comparison to Prior Work

vs. Wang et al.: This paper covers general benchmarks and broad applications beyond just Software Engineering
vs. Singh et al.: While Singh focuses on RAG, this paper includes protocols and broad reasoning benchmarks
vs. Yan et al.: Extends beyond communication to include a massive taxonomy of evaluation benchmarks
+ 1 more
vs. Yehudai et al.: Provides a more unified treatment combining benchmarks, frameworks, applications, and protocols in one document

Limitations

As a survey, it does not propose a new model or algorithm
The fast pace of the field means some benchmarks might be superseded quickly
Analysis relies on reported results from other papers rather than running new unified experiments

Reproducibility

Not applicable (Survey paper). No code or models released by the authors themselves, though they reference many public benchmarks.

📊 Experiments & Results

Evaluation Setup

Comparative analysis of existing benchmarks and their reported results on SOTA models

Benchmarks:

Humanity's Last Exam (HLE) (Expert-level academic reasoning)
ENIGMAEVAL (Multimodal puzzle solving)
Agent-as-a-Judge (DevAI) (Automated AI development evaluation)
SimpleQA (Factual accuracy on short questions)

Metrics:

Accuracy
Pass rate
Alignment with human judgment
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The survey aggregates performance metrics from various benchmarks to demonstrate the current limitations of LLMs.
ENIGMAEVAL	Accuracy	0	7	+7
Humanity's Last Exam (HLE)	Accuracy	90	10	-80
Agent-as-a-Judge (DevAI)	Alignment with Human Judgment	70	90	+20
Agent-as-a-Judge (DevAI)	Cost ($)	1297.50	30.58	-1266.92
SimpleQA	Accuracy	28.9	42.7	+13.8

Main Takeaways

Gap between General and Expert Reasoning: While models crush MMLU (>90%), they fail significantly on expert-curated benchmarks like HLE (<10%) and multimodal puzzles like ENIGMAEVAL (~7%).
Evaluation Innovation: New frameworks like Agent-as-a-Judge are critical, offering 90% human alignment at ~2% of the cost, addressing the scalability bottleneck of human oversight.
Fragility of Tool Use: Benchmarks like ComplexFuncBench and SimpleQA reveal that models still struggle with multi-step function calling (premature termination) and factual grounding (hallucination).
Emergence of Protocols: The field is moving toward standardized protocols (ACP, MCP) to manage the increasing complexity of multi-agent collaboration.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM capabilities (reasoning, generation)
Familiarity with autonomous agent concepts (planning, tool use, memory)
Basic knowledge of evaluation metrics (accuracy, F1, pass@k)

Key Terms

Agentic RAG: Retrieval-Augmented Generation systems enhanced with autonomous agents that perform reflection, planning, and multi-step reasoning

MMLU: Measuring Massive Multitask Language Understanding—a benchmark testing zero-shot and few-shot performance across 57 subjects

MCP: Model Context Protocol—a standard for connecting AI assistants to systems where data lives

ACP: Agent Communication Protocol—mechanisms allowing distinct agents to exchange messages and coordinate

Humanity's Last Exam (HLE): A 2025 benchmark with 3,000 expert-level questions designed to be resistant to simple retrieval, where SOTA models fail significantly

Agent-as-a-Judge: An evaluation framework where an AI agent evaluates the outputs of other agents, often offering granular feedback cheaper than human evaluation

ProcessBench: A benchmark for detecting errors in the reasoning steps of mathematical problem solving

Fact Grounding: The ability of an LLM to base its responses strictly on provided source documents, minimizing hallucination

Hallucination: When an LLM generates plausible-sounding but factually incorrect information