LLM-as-a-Judge: Using a Large Language Model to evaluate the quality of outputs from other models, typically for text generation
DevAI: A new benchmark dataset introduced in this paper containing 55 comprehensive AI development tasks with hierarchical requirements
Trajectory: The sequence of thoughts, actions, and observations an agent generates while solving a task
Judge Shift: A metric measuring the deviation of an AI judge's evaluation from the consensus of human judges
Alignment Rate: The percentage of time an AI judge's decision matches the human consensus decision
DAG: Directed Acyclic Graph—a structure used here to model dependencies between task requirements
Pass@1: A metric measuring the percentage of problems solved with a single attempt
SWE-Bench: A benchmark for evaluating large language models on real-world software engineering issues from GitHub
HumanEval: A benchmark dataset of Python coding problems used to evaluate code generation models