Process-Centric Analysis of Agentic Software Systems

📝 Paper Summary

Agentic software systems Agent evaluation methodologies Software engineering agents

The paper introduces Graphectory, a graph-based representation of agent trajectories, to analyze how agents solve problems rather than just if they succeed, enabling real-time detection and correction of inefficient strategies.

Core Problem

Current agent evaluation is outcome-centric (success/failure), masking the recurrent inefficiencies, chaotic behaviors, and lack of validation in trajectories that randomly lead to success or failure.

Why it matters:

Outcome-centric metrics fail to explain how agents reason, plan, or adapt strategies, preventing systematic improvements.
Agents often succeed by chance despite inefficient processes (e.g., editing files line-by-line vs. patches), which outcome metrics treat identically.
Without process visibility, it is difficult to distinguish systematic reasoning from stochastic luck or to intervene when agents get stuck in loops.

Concrete Example: Two agents fix the same bug (django-10973). SWE-agentDev succeeds but takes 15 steps with repetitive edits and weak validation. SWE-agentDSK-V3 also succeeds in 9 steps but skips validation entirely. Outcome metrics rate them equal (both 'Success'), hiding the risky no-validation strategy of the second agent.

Key Novelty

Graphectory and Langutory: Graph-based Trajectory Representation

Encodes linear agent logs into a graph (Graphectory) where nodes are actions and edges capture both temporal sequence and structural navigation (e.g., file hierarchy).
Abstracts this graph into a string sequence (Langutory) representing logical phases (Localization, Patching, Validation), enabling regex-like pattern mining for strategy analysis.

Architecture

Visual comparison of Graphectory and Langutory for two agents (SWE-agentDev vs SWE-agentDSK-V3) solving the same issue.

Evaluation Highlights

Online monitoring with intervention improved resolution rates by 11.9% on average (up to 23.5%) across problematic instances.
Intervention repaired trajectories in 94.1% of consistent failure cases (86 instances), turning chaotic loops into valid workflows.
Analysis of 4000 trajectories reveals that stronger LLMs (e.g., Claude 4) use more complex structures (higher node/edge counts) reflecting deeper exploration.

Breakthrough Assessment

8/10

Significant shift from outcome-based to process-based evaluation. The Graphectory abstraction is a powerful tool for debugging agents, and the demonstrated ability to fix agents in real-time is highly impactful.

⚙️ Technical Details

Problem Definition

Setting: Software engineering tasks (issue repair) performed by autonomous agents

Inputs: GitHub issue description and codebase

Outputs: Patch file to resolve the issue and a trajectory of actions taken

Pipeline Flow

Agent Execution -> Trajectory Log Generation
Graphectory Construction (Nodes=Actions, Edges=Temporal/Structural)
Phase Labeling (Map actions to L, P, V phases)
Langutory Abstraction (Compress graph to phase string)
Online Monitor (Analyze Langutory/Graphectory for patterns)
Intervention (If issue detected -> Rollback/Message -> Agent)

System Modules

Trajectory Logger

Captures linear sequence of agent actions, observations, and reasoning

Model or implementation: N/A (Logging infrastructure)

Graphectory Builder (Analysis Engine)

Converts logs into graph structure, adding structural edges based on file/directory relationships

Model or implementation: Rule-based algorithm

Phase Labeler (Analysis Engine)

Annotates each node with a logical phase (Localization, Patching, Validation)

Model or implementation: Algorithm 1 (Heuristic + Map)

Online Monitor & Intervener

Real-time analysis of partial Graphectory to flag inefficiencies and send diagnostic messages/rollbacks

Model or implementation: Heuristic-based (H1-H4)

Novel Architectural Elements

Online feedback loop using graph-based process metrics to trigger interventions (rollback/guidance) during agent execution
Dual-edge graph representation (Graphectory) combining temporal execution flow with structural problem-space navigation

Modeling

Base Model: Evaluated on DeepSeek-V3, DeepSeek-R1, Devstral-small-2505, Claude Sonnet 3.5 (referred to as Sonnet 4 in paper text)

Comparison to Prior Work

vs. Outcome-centric: Graphectory evaluates the *process* (efficiency, strategy) not just the result
vs. Manual Taxonomies: Graphectory automates the analysis, scaling to thousands of trajectories and enabling real-time monitoring [not cited in paper as comparison, but distinct contribution]

Limitations

Relies on accurate mapping of tools to logical phases, which requires domain knowledge/maintenance
Analysis overhead is low but requires access to internal trajectory steps, not just final output
Heuristics for intervention (H1-H4) are manually defined and may need tuning for different domains
Evaluation focused on software engineering domain; generalization to other agentic tasks (e.g., web navigation) discussed but not empirically tested

📊 Experiments & Results

Evaluation Setup

Automated repair of GitHub issues using agents

Benchmarks:

SWE-bench Verified (Real-world software issue resolution)

Metrics:

Node Count (NC)
Temporal Edge Count (TEC)
Loop Count (LC)
Average Loop Length (ALL)
Structural Edge Count (SEC)
Structural Breadth (SB)
Resolution Rate (Success %)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Online monitoring and intervention significantly improves agent performance on problematic instances.
SWE-bench Verified (Problematic Instances)	Resolution Rate Improvement	0	11.9	+11.9
SWE-bench Verified (Consistent Failures)	Trajectory Repair Rate	0	94.1	+94.1
Process metric analysis reveals that unresolved issues consistently exhibit more complex and inefficient graph structures.
SWE-bench Verified	Graph Complexity (Qualitative)	Lower	Higher	Positive

Experiment Figures

Linear trajectory traces of two agents solving django-10973, highlighting the difference between a thorough process and a lucky guess.

Main Takeaways

Complexity correlates with difficulty: As problem difficulty increases, agents produce larger graphs (more nodes/edges) and shift strategies more frequently.
Strategic adaptation: Agents deviate from the 'golden' plan (Localization-Patching-Validation) often; successful agents on hard problems use iterative debugging loops, while failing ones get stuck in chaotic backtracking.
Efficiency gap: Even successful agents are often inefficient (e.g., redundant edits, skipping validation), which Graphectory detects but outcome metrics miss.
Intervention viability: Real-time graph analysis adds negligible overhead (<10ms) but can salvage failing trajectories by forcing strategy corrections.

📚 Prerequisite Knowledge

Prerequisites

Understanding of agentic workflows (ReAct)
Software engineering lifecycle (Localization, Patching, Validation)
Graph theory basics (nodes, edges, cycles)

Key Terms

Graphectory: A directed cyclic graph representing an agent's trajectory, where nodes are actions and edges represent temporal or structural relationships

Langutory: A human-readable string abstraction of Graphectory (e.g., 'L5P1V') representing the sequence of logical phases (Localization, Patching, Validation) and their lengths

SWE-agent: An open-source agentic system designed to solve GitHub issues using a command-line interface

OpenHands: An open-source platform for developing and evaluating software engineering agents

Phase Transition Sequence: The sequence of distinct logical phases an agent moves through (e.g., Localization -> Patching -> Validation)

Structural Edge: An edge in Graphectory connecting actions that operate on subsuming entities (e.g., directory -> file -> block), capturing navigation depth

Temporal Edge: An edge in Graphectory connecting actions in chronological order of execution

ReAct: Reasoning and Acting—a paradigm where agents generate reasoning traces before executing actions

SWE-bench Verified: A benchmark dataset of resolved GitHub issues used to evaluate software engineering agents

Back Edge: A graph edge pointing to a previously visited node or phase, indicating loops, repetition, or backtracking in the agent's strategy