The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

📝 Paper Summary

AI Governance and Transparency AI Agent Evaluation Deployed Agentic Systems

The 2025 AI Agent Index systematically documents 30 high-impact deployed agentic systems across six categories to reveal critical gaps in transparency, safety practices, and evaluation standards.

Core Problem

Despite rapid deployment and economic investment in agentic AI, the ecosystem remains opaque, with little public information available to researchers and policymakers regarding system capabilities, development processes, and safety guardrails.

Why it matters:

Policymakers lack data on who is developing impactful systems and what risks they pose, hindering effective regulation
Researchers struggle to track rapid evolution in the agent ecosystem due to inconsistent documentation
Highly capable agents present unique risks (e.g., direct harm via tool use) that distinct from chat-based systems, yet safety practices remain obscure

Concrete Example: While chatbots cause harm only if users act on outputs, agentic systems can directly execute actions like autonomously hacking websites. Yet, most developers share little information about what guardrails prevent these specific autonomous risks.

Key Novelty

The 2025 AI Agent Index

Systematically annotates 30 state-of-the-art deployed agents across 45 distinct fields covering legal, technical, autonomy, ecosystem, evaluation, and safety dimensions
Introduces rigorous inclusion criteria combining agency definitions (autonomy, goal complexity, environmental interaction, generality) with real-world impact metrics (search volume, market cap)
Reveals ecosystem-wide trends by analyzing transparency levels and development practices across three distinct agent types: chat applications, browser agents, and enterprise workflows

Evaluation Highlights

Indexed 30 highly agentic products selected from 95 candidates based on strict agency and impact criteria
Annotated 45 distinct information fields per system, revealing that most developers share minimal information on safety and societal impact
Identified 23% response rate from companies contacted for verification, lower than the previous year's index

Breakthrough Assessment

9/10

A critical resource for the field. While not a technical architecture paper, it establishes the standard for documenting and analyzing the rapidly growing landscape of deployed AI agents.

⚙️ Technical Details

Problem Definition

Setting: Systematic documentation and analysis of deployed agentic AI systems

Inputs: Publicly available information (documentation, blogs, demos) and developer correspondence

Outputs: Structured annotations across 6 categories for 30 agentic systems

Pipeline Flow

Candidate Identification (LLM search + expert consultation)
Screening (Agency + Impact + Practicality criteria)
Annotation (Manual review of 45 fields by experts)
Verification (Developer correspondence + GPT-5.2 screening)

System Modules

Candidate Identification (Selection)

Surface potential agents for inclusion

Model or implementation: LLM-based research queries

Screening (Selection)

Filter candidates based on strict inclusion criteria

Model or implementation: Manual review against criteria

Annotation

Extract detailed information across 6 categories

Model or implementation: Human subject matter experts

Verification

Validate annotations and solicit developer feedback

Model or implementation: Human outreach + GPT-5.2 automated screening

Novel Architectural Elements

Integration of automated LLM-based screening (GPT-5.2) to validate human annotations for large-scale qualitative datasets

Comparison to Prior Work

vs. 2024 AI Agent Index: Significantly stricter inclusion criteria focusing on high-impact deployed systems (30 vs. larger set), deeper annotation (45 fields), and new categories (ecosystem interaction)
vs. Foundation Model Transparency Index: Focuses specifically on agentic systems and their unique properties (autonomy, tools) rather than base model features
vs. Princeton Holistic Agentic Leaderboard: Focuses on qualitative documentation of features, safety, and transparency rather than quantitative performance benchmarking
+ 1 more
vs. AIAgentList.com: Provides deep, verified annotations and safety analysis for a curated set rather than a broad unverified list

Limitations

Relies primarily on public information, which may be incomplete or outdated
Low response rate (23%) from developers limits verification of non-public details
Selection criteria may bias against open-source projects with lower search volume or funding
Rapid evolution of the field means the Index captures a snapshot in time (cutoff Dec 31, 2025)

Reproducibility

Code: https://aiagentindex.mit.edu

The full Index is available at https://aiagentindex.mit.edu. The methodology, inclusion criteria, and field definitions are fully documented in the paper. The dataset itself is the primary artifact.

📊 Experiments & Results

Evaluation Setup

Qualitative analysis and structured annotation of 30 deployed AI agents

Benchmarks:

Transparency Assessment (Evaluation of public disclosure) [New]

Metrics:

Presence of safety documentation
Disclosure of evaluation methods
Transparency regarding limitations
Developer response rate
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The rapid increase in research papers mentioning 'AI Agent' or 'Agentic AI' from 2020 to 2025

Main Takeaways

Transparency varies significantly: Developers are generally opaque about safety testing, evaluations, and societal impact assessments.
Safety documentation is scarce: Most developers share little information about specific guardrails or fail-safe mechanisms for autonomous agents.
Three dominant paradigms emerged: Chat applications with tools, Browser-based agents, and Enterprise workflow agents.
Verification challenges: Only 4 out of 30 companies provided substantive feedback, indicating a reluctance to engage with external transparency efforts.

📚 Prerequisite Knowledge

Prerequisites

Understanding of AI agents vs. standard LLMs
Familiarity with AI governance concepts (transparency, model cards, safety evaluation)
Basic knowledge of agent interaction paradigms (tool use, browser agents)

Key Terms

Agentic AI: Systems capable of pursuing complex goals with limited human oversight, often using tools and planning

Browser-based agents: Agents that primarily interact with web browsers or computer interfaces to perform tasks

Enterprise workflow agents: Business platforms allowing creation of agents to automate specific professional workflows

Autonomy Level 2: A level where the user and agent collaboratively plan, delegate, and execute, but the agent performs the majority of tasks independently

Goal complexity: The ability to pursue high-level objectives through long-term planning and sub-goal decomposition

Environmental interaction: The ability to directly change the state of the world through tools/APIs (e.g., writing files, sending emails)

Generality: The ability to handle under-specified instructions and adapt to new tasks rather than performing a single narrow function

Transparency Index: A framework for assessing how much information developers publicly disclose about their systems