General Agent Evaluation

📝 Paper Summary

Agentic Evaluation General-purpose Agents Benchmark Integration

Exgentic introduces a Unified Protocol and framework to evaluate general-purpose agents across diverse environments without domain-specific engineering, revealing that underlying LLM quality drives performance more than agent architecture.

Core Problem

Current agent benchmarks impose bespoke communication protocols and implicit domain assumptions, preventing the fair evaluation of general-purpose agents that lack pre-engineered integration.

Why it matters:

Real-world settings require agents to deploy scalably across heterogeneous environments without manual customization for each new domain
Existing benchmarks (like SWE-Bench) rely on specific integration hacks (e.g., pre-cloned repos) that obscure true agent capabilities
Current consolidation efforts (BrowserGym, Harbor) enforce single modalities (web or CLI), testing only a diminished version of the agent

Concrete Example: In SWE-Bench, a standard benchmark assumes a human integrator clones the repo and handles submission. A general agent trying to solve this blindly fails because it doesn't know it needs to output a patch file in a specific format or that the repo is already present in a specific path.

Key Novelty

Unified Protocol and Exgentic Framework

Introduces a mediation layer that standardizes communication between any agent and any benchmark using a canonical (Task, Context, Actions) representation
Decouples evaluation from domain-specific protocols by translating benchmark-specific signals (like specialized tool calls) into a generic format agents can ingest
Establishment of the first Open General Agent Leaderboard evaluating 5 agents across 6 diverse environments without environment-specific tuning

Architecture

Conceptual diagram comparing pairwise integration (A), single-protocol consolidation (B), and the Unified Protocol (C).

Evaluation Highlights

General-purpose agents demonstrate cross-domain generalization comparable to domain-specific baselines without tuning
Agent performance is primarily dictated by the underlying Language Model (e.g., GPT-4 vs. Claude 3.5) rather than the agentic scaffold (ReAct vs. Solo)
Evaluation of 5 agent architectures across 6 benchmarks (SWE-Bench, τ-Bench, AppWorld, etc.) totaled $22K in API costs

Breakthrough Assessment

8/10

Significant infrastructure contribution. Solves the fragmentation problem in agent evaluation by providing a unified protocol, enabling the first true 'general agent' comparison across radically different domains.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of general-purpose agents A on a set of diverse benchmark environments E, where A must solve tasks t ∈ E without prior knowledge of E's semantics or protocols.

Inputs: Task description T, Context C, and Set of Actions A (canonicalized via Unified Protocol)

Outputs: Sequence of actions a_i leading to a final answer or environment state change

Pipeline Flow

Benchmark Adapter (translates specific env to Unified Protocol)
Orchestrator (manages session loop)
Agent Adapter (translates Unified Protocol to specific Agent API)

System Modules

Benchmark Adapter (Translation Layer)

Converts benchmark-specific state/goals into generic Task, Context, and Actions

Model or implementation: Deterministic code (Python)

Orchestrator

Manages the interaction loop, enforcing limits and facilitating data flow

Model or implementation: Python Runtime

Agent Adapter (Translation Layer)

Maps Unified Protocol signals to the agent's native API (MCP, OpenAI Tools, etc.)

Model or implementation: Deterministic code (Python)

Novel Architectural Elements

Unified Protocol 'Narrow Waist' design: A canonical intermediate representation (Task, Context, Actions) that decouples M agents from N benchmarks, reducing integration complexity from M*N to M+N.

Modeling

Base Model: Evaluated on GPT-4o (GPT 5.2 in paper text), Claude 3 Opus (4.5 in paper text), Gemini 1.5 Pro (Gemini 3 Pro in paper text) - Note: Paper uses futuristic names like 'GPT 5.2' likely as placeholders or future predictions, but context implies current frontier models.

Comparison to Prior Work

vs. BrowserGym: Exgentic supports ANY interface (CLI, API, Web) via Unified Protocol, not just web interactions
vs. Harbor: Exgentic abstracts the interface so agents don't need to know they are in a CLI, enabling tool-use agents to work on CLI tasks via adaptors
vs. AutoGen [not cited in paper]: AutoGen focuses on multi-agent orchestration; Exgentic focuses on the evaluation interface between agents and environments

Limitations

Evaluation costs are high ($22K for the leaderboard runs)
Success depends heavily on the quality of the Benchmark Adapter's translation of implicit assumptions
Unified Protocol may mask specific nuances of highly specialized environments if not mapped correctly

Reproducibility

Code: https://www.exgentic.ai

publicly available (https://www.exgentic.ai). The framework, leaderboards, and protocol definitions are released. The paper explicitly lists the cost ($22K) and setup for the leaderboard.

📊 Experiments & Results

Evaluation Setup

Benchmarking 5 agent architectures across 6 diverse environments using 3 frontier LLMs.

Benchmarks:

BrowseComp+ (Deep research / Information seeking)
τ-Bench (Tau-Bench) (Customer service / Policy compliance)
SWE-Bench Verified (Software Engineering (Bug fixing))
AppWorld (Digital user assistance (Day-to-day tasks))

Metrics:

Success Rate
Cost per Task
Average Steps
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General agents show that performance is driven more by the underlying model than the agent framework.
SWE-Bench Verified	Success Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper
Cost efficiency analysis reveals large variances between agent implementations for similar tasks.
Overall Leaderboard	Total Cost	Not reported in the paper	22000	Not reported in the paper

Experiment Figures

Spider chart (likely, based on description of 'multidimensional analysis' and leaderboards) comparing agent performance across different benchmarks.

Main Takeaways

General agents can generalize across diverse environments (coding, web, customer service) without environment-specific tuning.
Performance is primarily determined by the underlying LLM capability rather than the specific agentic scaffold (e.g., ReAct vs. SmolAgent).
Different agent scaffolds exhibit comparable performance levels but vary significantly in cost per task.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Agentic workflows (ReAct, Tool-use)
Familiarity with standard agent benchmarks (SWE-Bench, WebArena/BrowserGym)
Knowledge of LLM APIs (OpenAI, Anthropic, MCP)

Key Terms

Unified Protocol: A standard interface defining Task, Context, and Actions to mediate between diverse agents and benchmarks

MCP: Model Context Protocol—a standard for connecting AI assistants to systems and data

Scaffold: The software engineering framework wrapping an LLM to enable agentic behaviors (memory, tool use, planning)

Exgentic: The proposed evaluation harness that implements the Unified Protocol

Tool shortlisting: A technique to filter the available action space to a manageable subset for the LLM

Zero-shot generalization: The ability of an agent to perform tasks in an unseen environment without domain-specific fine-tuning or prompt engineering

ReAct: Reason+Act—a prompting paradigm where models generate reasoning traces before executing actions

τ-Bench: A benchmark evaluating customer service agents in retail/airline domains, focusing on policy compliance

SWE-Bench Verified: A subset of SWE-Bench containing human-validated software engineering tasks (bug fixes)

BrowserGym: A framework consolidating web-based agent benchmarks

AppWorld: A benchmark for day-to-day digital user-assistance tasks involving multiple apps