← Back to Paper List

Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

Meta AI, King Abdullah University of Science and Technology

arXiv (2024)

Agent Benchmark RL

📝 Paper Summary

Agentic system evaluation Code generation agents Automated AI development

Agent-as-a-Judge uses agentic systems to evaluate other agents by analyzing full intermediate trajectories, providing richer feedback than final-outcome metrics or static LLM judges.

Core Problem

Current evaluations for agentic systems rely on final outcomes (ignoring step-by-step logic) or expensive human labor, failing to provide the intermediate feedback necessary for self-improvement.

Why it matters:

Benchmarks like SWE-Bench rely on final resolve rates, missing internal process failures that affect performance
Human evaluation provides rich feedback but is prohibitively expensive and unscalable for rapid agent iteration
Standard LLM-as-a-Judge approaches lack the tooling (file reading, code execution) to verify complex, multi-step agent trajectories

Concrete Example: In a coding task, a developer agent might fail because of a small dependency error mid-process. A standard 'pass/fail' metric only reports failure at the end, while Agent-as-a-Judge can identify the specific file and step where the dependency was missed.

Key Novelty

Agent-as-a-Judge Framework & DevAI Benchmark

Extends LLM-as-a-Judge by equipping the evaluator with agentic tools (graph construction, file execution, locating code) to verify intermediate steps, not just final text
Introduces DevAI, a benchmark of 55 realistic AI development tasks with 365 hierarchical requirements, designed to test full development cycles rather than isolated snippets

Architecture

Architecture Figure Figure 6

The modular architecture of the Agent-as-a-Judge system.

Evaluation Highlights

Agent-as-a-Judge aligns with human consensus 90% of the time, significantly outperforming LLM-as-a-Judge (70%) on the DevAI benchmark
Reduces evaluation time by 97.72% and cost by 97.64% compared to a panel of three human experts
Leading agents (GPT-Pilot, OpenHands) only satisfy ~29% of requirements in DevAI, indicating the benchmark provides a significant challenge

Breakthrough Assessment

8/10

Strong proof-of-concept for scalable, high-quality agent evaluation. The release of DevAI and the cost/performance analysis against human judges make it a valuable contribution to the agentic workflow ecosystem.

⚙️ Technical Details

Problem Definition

Setting: Automated evaluation of code-generating agents on multi-step AI development tasks

Inputs: Agent trajectories (thought processes, actions), generated workspaces (code files), and hierarchical task requirements

Outputs: Binary verification of requirement satisfaction and detailed feedback on specific failures

Pipeline Flow

Graph Construction (analyzes project structure)
Information Gathering (Locate → Read → Retrieve)
Verification (Ask)

System Modules

Graph

Constructs a dependency graph of the project structure, files, and modules

Model or implementation: gpt-4o-2024-05-13

Locate (Information Retrieval)

Identifies specific folders or files relevant to a requirement

Model or implementation: gpt-4o-2024-05-13

Read (Information Retrieval)

Parses and understands multimodal data (code, images, PDFs) from identified files

Model or implementation: gpt-4o-2024-05-13

Retrieve (Information Retrieval)

Extracts relevant segments from the agent's historical trajectory (logs/actions)

Model or implementation: gpt-4o-2024-05-13

Ask

Determines if a requirement is satisfied based on gathered context

Model or implementation: gpt-4o-2024-05-13

Novel Architectural Elements

Modular decomposition of the 'Judge' into specialized sub-agents (Graph, Locate, Read, Retrieve, Ask) rather than a single prompt
Integration of workspace analysis (files, code execution results) with trajectory analysis (logs) for verification

Modeling

Base Model: gpt-4o-2024-05-13

Compute: Agent-as-a-Judge cost $30.58 in API calls for the full benchmark evaluation (vs $1297.50 for humans). Average time 118.43 minutes.

Comparison to Prior Work

vs. LLM-as-a-Judge: Adds active information gathering tools (reading files, navigating directories) instead of passive text ingestion
vs. SWE-Bench Evaluation: Evaluates intermediate steps and hierarchical requirements via agentic inspection, not just final pass/fail tests
vs. MLE-Bench [not cited in paper]: Focuses on step-by-step development process verification rather than just final competition submission performance

Limitations

Planning module (for the judge itself) was found to be unstable/detrimental and excluded from the final best configuration
Memory module propagated errors from previous judgments, negatively affecting performance
Search module underutilized because generated workspaces were too small (hundreds of lines) to require complex search

Reproducibility

Code: https://github.com/metauto-ai/agent-as-a-judge

publicly available (https://github.com/metauto-ai/agent-as-a-judge). Dataset DevAI available at https://huggingface.co/devai-benchmark. Code generation baselines (MetaGPT, GPT-Pilot, OpenHands) are open-source.

📊 Experiments & Results

Evaluation Setup

Evaluation of 3 AI Developer agents on the DevAI benchmark using Human, LLM, and Agent judges

Benchmarks:

DevAI (Automated AI Development) [New]

Metrics:

Judge Shift (deviation from human consensus)
Alignment Rate (% match with human consensus)
Precision-Recall Curves
Statistical methodology: Consensus voting among 3 human experts used as ground truth; disagreement rates analyzed

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Judge reliability against Human Consensus for evaluating the OpenHands agent.
DevAI (OpenHands evaluation)	Alignment Rate	60.38	90.44	+30.06
DevAI (OpenHands evaluation)	Alignment Rate	70.76	92.07	+21.31
DevAI (OpenHands evaluation)	Judge Shift (lower is better)	11.12	4.15	-6.97
Ablation study on Agent-as-a-Judge components shows the value of active file interactions.
DevAI	Alignment Rate	65.03	90.44	+25.41
DevAI	Alignment Rate	82.24	90.44	+8.20

Experiment Figures

Error rates of individual human evaluators vs. majority vote vs. Agent-as-a-Judge relative to the Human Consensus ground truth.

Main Takeaways

Agent-as-a-Judge is practically as reliable as a human evaluator (90% alignment) but costs ~2.3% of the price and time.
The 'Locate' and 'Read' modules are critical for performance, confirming that direct file access is necessary for accurate code evaluation.
Current top agentic developers (OpenHands, GPT-Pilot) struggle with real-world tasks, completing only ~29% of requirements in DevAI.
Majority voting among humans corrects significant individual errors (individual error rates ~23% drop to ~6% with consensus).

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agents (planning, tool use)
Familiarity with code generation benchmarks (SWE-Bench, HumanEval)
Basic knowledge of 'LLM-as-a-Judge' evaluation paradigms

Key Terms

LLM-as-a-Judge: Using a Large Language Model to evaluate the quality of outputs from other models, typically for text generation

DevAI: A new benchmark dataset introduced in this paper containing 55 comprehensive AI development tasks with hierarchical requirements

Trajectory: The sequence of thoughts, actions, and observations an agent generates while solving a task

Judge Shift: A metric measuring the deviation of an AI judge's evaluation from the consensus of human judges

Alignment Rate: The percentage of time an AI judge's decision matches the human consensus decision

DAG: Directed Acyclic Graph—a structure used here to model dependencies between task requirements

Pass@1: A metric measuring the percentage of problems solved with a single attempt

SWE-Bench: A benchmark for evaluating large language models on real-world software engineering issues from GitHub

HumanEval: A benchmark dataset of Python coding problems used to evaluate code generation models