OpenTelemetry: A standardized framework and format for generating and collecting telemetry data (traces, metrics, logs) from software, used here to structure agent logs
Agentic Workflow: A system where an LLM dynamically selects tools and plans steps to solve a problem, often involving loops and multi-step reasoning
Trace: A chronological record of the execution steps taken by an agent, including inputs, outputs, tool calls, and system responses
Span: A single operation within a trace, such as a specific tool call or an LLM generation step
SWE-Bench: A benchmark for evaluating LLMs on real-world software engineering issues from GitHub
GAIA: General AI Assistants benchmark—a dataset of real-world questions requiring reasoning, tool use, and multimodality
Joint Accuracy: A metric that counts a prediction as correct only if the model identifies BOTH the correct step location AND the correct error category
Hallucination: When an LLM generates content that is factually incorrect or ungrounded; in this context, specifically including 'Tool-related hallucinations' where agents invent tool outputs