TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces

📝 Paper Summary

Agentic System Evaluation Automated Debugging

TraceSIR automates the diagnosis of agentic systems by using three specialized agents to structure long execution traces, identify root causes, and generate aggregated analysis reports.

Core Problem

Manual inspection of agent execution traces is unscalable due to length and complexity, while feeding raw traces to LLMs causes context overflow and hallucinations.

Why it matters:

Single tasks can generate thousands of tool invocations and tokens, making manual root cause analysis prohibitively difficult for humans
Outcome-based evaluation (pass/fail) discards critical behavioral data needed to fix logic errors or infinite loops
Existing automated methods struggle with context limits, often producing meaningless or hallucinatory analysis when processing raw traces

Concrete Example: A coding agent might fail a task after 50 tool calls. An outcome-based metric just reports '0/1', while a raw-trace analysis might crash the context window. TraceSIR structures the trace to pinpoint that the agent hallucinated a file path at step 32.

Key Novelty

Multi-Agent Structured Trace Analysis

Introduces 'TraceFormat' to abstract raw execution logs into a structured, compressed representation (Thought, Action, Observation) that preserves causality while reducing token count
Decomposes analysis into three roles: StructureAgent (compressing traces), InsightAgent (diagnosing individual errors), and ReportAgent (aggregating patterns across multiple cases)

Architecture

The TraceSIR framework workflow: Input Traces -> StructureAgent -> InsightAgent -> ReportAgent -> Final Report.

Evaluation Highlights

+9.7% average improvement in report quality (human evaluation) compared to ClaudeCode across three benchmarks
+7.5% average improvement in report quality (LLM-as-a-judge) compared to ClaudeCode
+26.0% relative improvement in Agentic Coding scenarios when using GLM-5 as the backbone

Breakthrough Assessment

8/10

Addresses a critical bottleneck in agent development (debugging long traces) with a practical, open-source multi-agent framework. Strong empirical gains in report utility.

⚙️ Technical Details

Problem Definition

Setting: Automated generation of diagnostic reports from sets of failed agent execution traces

Inputs: A collection of execution traces (JSON/ZIP), each containing message sequences M = {m1, ..., mK}

Outputs: A comprehensive markdown analysis report identifying errors, root causes, and optimization suggestions

Pipeline Flow

StructureAgent (Parses and compresses raw traces)
InsightAgent (Analyzes individual structured traces for errors)
ReportAgent (Aggregates insights and writes final report)

System Modules

StructureAgent

Compress raw execution traces into TraceFormat to fit context limits

Model or implementation: GLM-5 or Claude-4.6 (Backbone)

InsightAgent

Perform fine-grained diagnosis on a single trace instance

Model or implementation: GLM-5 or Claude-4.6 (Backbone)

ReportAgent

Synthesize diagnostics across multiple cases into a coherent report

Model or implementation: GLM-5 or Claude-4.6 (Backbone)

Novel Architectural Elements

Separation of concerns between structural compression (StructureAgent) and semantic reasoning (InsightAgent) to handle long context
TraceFormat abstraction layer that aligns Thought-Action-Observation for machine readability

Modeling

Base Model: Evaluated with GLM-5 and Claude-4.6 backbones

Limitations

Dependency on the underlying LLM's reasoning capability for root cause analysis
Evaluation focuses on report quality/utility rather than direct improvement of the diagnosed agents
Requires access to full execution traces (intermediate messages), which may not be available for black-box APIs

Reproducibility

Code: https://github.com/SHU-XUN/TraceSIR

📊 Experiments & Results

Evaluation Setup

Report generation quality assessment using both human experts and LLM-as-a-judge

Benchmarks:

TraceBench (Agentic Failure Analysis) [New]

Metrics:

ReportEval Score (0-100)
Overall Structure (OS)
Error Analysis (EA)
Root Cause Analysis (RCA)
Optimization Analysis (OA)
Overall Impact (OI)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human evaluation results showing TraceSIR's improvement over ClaudeCode across three domain scenarios.
TraceBench (Deep Research)	ReportEval Score	Normalized Base	Base + 10.0%	+10.0%
TraceBench (Function Calling)	ReportEval Score	Normalized Base	Base + 13.0%	+13.0%
TraceBench (Agentic Coding)	ReportEval Score	Normalized Base	Base + 5.0%	+5.0%
LLM-as-a-judge evaluation results showing consistent trends with human evaluation.
TraceBench (Agentic Coding)	ReportEval Score	Normalized Base	Base + 26.0%	+26.0%
TraceBench (Average)	ReportEval Score	Normalized Base	Base + 7.5%	+7.5%

Experiment Figures

Comparison of analysis on raw vs. structured traces, highlighting hallucination in raw trace analysis.

Main Takeaways

TraceSIR consistently produces more coherent and actionable reports than ClaudeCode, validated by both human experts (+9.7%) and LLM judges (+7.5%)
The framework shows robust performance across different backbones (GLM-5 and Claude-4.6), improving diagnostic quality even with weaker models
Largest gains are observed in complex settings like Agentic Coding, suggesting structured trace analysis is particularly beneficial for logic-heavy tasks
The multi-agent approach effectively handles long contexts that typically cause hallucinations in single-model analysis

📚 Prerequisite Knowledge

Prerequisites

Understanding of Agentic Systems (LLMs + Tools)
Familiarity with OpenAI message format (role/content)
Context Window limitations in LLMs

Key Terms

TraceFormat: A structured abstraction introduced by this paper that parses raw messages into Thought, Action, and Observation columns to reduce redundancy

RCA: Root Cause Analysis—identifying the fundamental reason for a failure rather than just the symptom

LLM-as-a-judge: Using a strong Language Model to evaluate the quality of outputs (in this case, analysis reports) instead of human annotators

Agentic Coding: Agents that write and execute code to solve software engineering tasks (e.g., SWE-bench)

Context Window: The maximum amount of text (tokens) an LLM can process at once; exceeding it leads to errors or forgetting