Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems

📝 Paper Summary

Agentic system evaluation Observability for LLM agents

The paper proposes replacing outcome-based black-box evaluation with Agentic System Behavioral Benchmarking, utilizing a new observability taxonomy to analyze non-deterministic execution flows, decision-making processes, and variability in agentic systems.

Core Problem

Traditional black-box benchmarking evaluates only final outputs, failing to capture the non-deterministic execution flows, stochastic decision-making, and intermediate reasoning errors inherent in complex multi-agent systems.

Why it matters:

Minor input phrasing changes can radically alter execution paths (flow variability) even if the final output is correct, hiding potential fragility.
Practitioners struggle with root cause diagnosis; 77% of surveyed users report difficulties, and 80% identify non-deterministic execution as a major challenge.
Existing tools focus on LLM-level metrics (tokens, latency) rather than agentic behaviors like tool selection, workflow adherence, and multi-agent coordination.

Concrete Example: A calculator agent processing '((6+2)×[8−3×2])' might execute correctly one time but fail or take a convoluted path another time due to stochastic LLM planning. Simple accuracy metrics (MSE) miss this internal chaos, where the graph-edit distance between execution traces of the same input shows 63% variability.

Key Novelty

Agentic System Behavioral Benchmarking & ABBench

Introduces a standardized taxonomy for agent observability that extends OpenTelemetry to track agent-specific entities (Workflows, Tools, Agents, Organizations) rather than just LLM calls.
Proposes a 'White-Box' analytics approach that evaluates the *process* (execution graph structure, tool usage patterns) alongside the *outcome*.
Defines specific metrics for variability, such as Graph Edit Distance (GED) between execution traces of identical inputs, to quantify non-determinism.

Architecture

Entity Relationship Diagram of the proposed Agentic System Observability model.

Evaluation Highlights

User study (n=38) confirms 80% of practitioners view non-deterministic flows as a major challenge, validating the problem definition.
Experiments on a calculator agent reveal 63% coefficient of variation in execution flow structure (Graph Edit Distance) for identical mathematical inputs across 5 runs.
Natural language variability causes significant instability: 19% coefficient of variation in accuracy (MSE) when processing identical math problems phrased in natural language.

Breakthrough Assessment

7/10

Strong methodological contribution proposing a necessary shift from outcome-based to behavioral evaluation. The taxonomy and dataset (ABBench) are valuable, though the specific algorithmic innovations are less emphasized than the framework itself.

⚙️ Technical Details

Problem Definition

Setting: Evaluation and observability of non-deterministic agentic systems

Inputs: Agent runtime logs (traces), natural language inputs, system configurations

Outputs: Analytics insights: Flow variability metrics (GED), accuracy stability (MSE variance), cost/latency statistics, and behavioral recommendations

Pipeline Flow

Instrumentation (OTel extension) → Log/Trace Collection
Trace Analysis (Graph construction) → Metric Computation
Benchmarking (Comparison against baselines/stability goals)

System Modules

Instrumentation Layer

Captures runtime events for agent entities (Start/End/Fail) and binds them to OTel spans

Model or implementation: OpenTelemetry extension

Analytics Engine

Processes raw traces to reconstruct execution graphs and compute variability metrics

Model or implementation: Custom Python analytics logic

Novel Architectural Elements

Extension of OTel semantic conventions specifically for Agentic Systems (introducing 'GenAI Events' for entity lifecycle tracking)
Hierarchical Task Flow analysis that reconstructs directed acyclic graphs of subtasks from flat traces to measure structural divergence

Modeling

Base Model: OpenAI GPT-4o (used for the Calculator agent experiments)

Compute: Experiments used OpenAI API (GPT-4o). Specific compute hardware not reported (inference only).

Comparison to Prior Work

vs. OpenLLMetry: Focuses on higher-level Agentic entities (Workflows, Tools, Organizations) rather than just LLM/VectorDB calls
vs. LangSmith: Introduces quantitative behavioral metrics like Graph Edit Distance for flow variability, rather than just pass/fail evaluation or qualitative inspection
vs. AgentBench [not cited in paper]: Focuses on observability/internal process analytics rather than just end-to-end task performance scores

Limitations

The approach relies on instrumentation; uninstrumented or closed-source black-box agents cannot be analyzed to this depth
Calculating Graph Edit Distance (GED) can be computationally expensive for very complex, long-running agent traces
Experiments were limited to a specific Calculator agent domain; generalization to other agent types (e.g., coding, browsing) is discussed but not empirically proven in the paper

Reproducibility

Code: https://github.com/genai-analytics/publications

publicly available (https://github.com/genai-analytics/publications). Includes the 'Agentic Calculator' implementation, the ABBench dataset, and experimentation scripts. The OTel extensions are described conceptually.

📊 Experiments & Results

Evaluation Setup

Repeated execution of a non-deterministic 'Agentic Calculator' on mathematical and natural language tasks to measure variability.

Benchmarks:

ABBench (Agent Analytics Behavioral Benchmark) (Mathematical reasoning and tool use) [New]

Metrics:

Mean Squared Error (MSE) for accuracy
Graph Edit Distance (GED) for flow variability
Coefficient of Variation (CV) for cost, time, and tokens
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments quantify the high variability in agent execution flows even for deterministic mathematical tasks when processed by agentic systems.
Pure Math Dataset	Coefficient of Variation (Flow/GED)	0	63	+63
Natural Language Math Dataset	Coefficient of Variation (Accuracy/MSE)	0	19	+19
User study results highlight the gap between practitioner needs and current tool capabilities.
Practitioner Survey (n=38)	Agreement on 'Non-deterministic flow is a major challenge'	0	80	+80

Experiment Figures

A hierarchical task flow graph visualization for the calculator agent.

Main Takeaways

Agentic systems exhibit high non-determinism (63% flow variability) even for clear mathematical tasks, necessitating tools that visualize and measure execution paths, not just outputs.
Natural language inputs significantly degrade consistency (19% accuracy variability) compared to structured inputs, confirming the impact of prompt sensitivity.
Practitioners overwhelmingly (76%) prioritize 'agentic flow understanding' over simple metrics, validating the proposed white-box analytics approach.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Agentic AI architectures (Agents, Tools, Workflows)
Familiarity with OpenTelemetry (OTel) observability standards
Knowledge of benchmarking methodologies

Key Terms

OTel: OpenTelemetry—an open-source observability framework for generating and collecting telemetry data like traces, metrics, and logs

Agentic System Behavioral Benchmarking: A proposed evaluation method focusing on analyzing execution patterns, decision-making, and interactions rather than just final outputs

Graph Edit Distance (GED): A measure of similarity between two graphs, used here to quantify how much an agent's execution path differs between runs

MSE: Mean Squared Error—used here to measure accuracy deviations in numerical outputs

LangGraph: A library for building stateful, multi-actor applications with LLMs, used to structure the agent's workflow

coefficient of variation (CV): A statistical measure of dispersion (standard deviation divided by the mean), used to quantify variability across multiple agent runs