Agent Scaffold: The code wrapping an LLM that defines its tools, system prompts, memory management, and control flow (how it loops/acts)
Pareto frontier: The set of optimal solutions where no single metric (e.g., accuracy) can be improved without sacrificing another (e.g., cost)
Orchestration: The automated management of computer systems and software; here, managing hundreds of VMs to run agent benchmarks in parallel
Rollout: A single complete execution of an agent attempting to solve a specific benchmark task from start to finish
Docent: A specific tool used in this paper for automated log analysis, using LLMs to check transcripts against rubrics for errors or specific behaviors
LiteLLM: A library that provides a unified interface for calling different LLM providers (OpenAI, Anthropic, etc.), handling API differences automatically
Instruction Violation: When an agent fails to follow specific constraints set in the prompt (e.g., 'return a blank string if unsure') even if it tries to solve the task
Weave: A logging and telemetry tool for LLM applications used here to capture execution traces
Inference-time compute: The computational effort spent during the generation of a response (e.g., reasoning tokens in o1/o3 models), as opposed to training time