OTel: OpenTelemetry—an open-source observability framework for generating and collecting telemetry data like traces, metrics, and logs
Agentic System Behavioral Benchmarking: A proposed evaluation method focusing on analyzing execution patterns, decision-making, and interactions rather than just final outputs
Graph Edit Distance (GED): A measure of similarity between two graphs, used here to quantify how much an agent's execution path differs between runs
MSE: Mean Squared Error—used here to measure accuracy deviations in numerical outputs
LangGraph: A library for building stateful, multi-actor applications with LLMs, used to structure the agent's workflow
coefficient of variation (CV): A statistical measure of dispersion (standard deviation divided by the mean), used to quantify variability across multiple agent runs