Agentic Benchmark: An evaluation suite where AI agents interact with tools and environments (e.g., coding, browsing) to solve multi-step tasks
Task Validity: The condition that a task should be solvable if and only if the agent possesses the specific target capability (no shortcuts, no impossible tasks)
Outcome Validity: The condition that the automatic evaluation result (e.g., test pass) accurately reflects whether the task was actually completed successfully
Fuzz Testing: A software testing technique that inputs invalid, unexpected, or random data into a program to find bugs or verify correctness
Metric Hacking: When an agent optimizes for the evaluation metric (e.g., score) without actually achieving the intended task goal
Unit Testing: Testing individual components of software (e.g., functions) in isolation
E2E Testing: End-to-End Testing—simulating complete user scenarios to validate the system as a whole