← Back to Paper List

TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces

Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu, Jiayi Gui, Dayong Yang, Yukuo Cen, Zhuoer Feng, Bosi Wen, Yidong Wang, Lucen Zhong, Jiamin Ren, Linfeng Zhang, Jie Tang
Beijing Institute of Technology, Zhipu AI, Tsinghua University, Shanghai Jiao Tong University
arXiv (2026)
Agent Benchmark

📝 Paper Summary

Agentic System Evaluation Automated Debugging
TraceSIR automates the diagnosis of agentic systems by using three specialized agents to structure long execution traces, identify root causes, and generate aggregated analysis reports.
Core Problem
Manual inspection of agent execution traces is unscalable due to length and complexity, while feeding raw traces to LLMs causes context overflow and hallucinations.
Why it matters:
  • Single tasks can generate thousands of tool invocations and tokens, making manual root cause analysis prohibitively difficult for humans
  • Outcome-based evaluation (pass/fail) discards critical behavioral data needed to fix logic errors or infinite loops
  • Existing automated methods struggle with context limits, often producing meaningless or hallucinatory analysis when processing raw traces
Concrete Example: A coding agent might fail a task after 50 tool calls. An outcome-based metric just reports '0/1', while a raw-trace analysis might crash the context window. TraceSIR structures the trace to pinpoint that the agent hallucinated a file path at step 32.
Key Novelty
Multi-Agent Structured Trace Analysis
  • Introduces 'TraceFormat' to abstract raw execution logs into a structured, compressed representation (Thought, Action, Observation) that preserves causality while reducing token count
  • Decomposes analysis into three roles: StructureAgent (compressing traces), InsightAgent (diagnosing individual errors), and ReportAgent (aggregating patterns across multiple cases)
Architecture
Architecture Figure Figure 2
The TraceSIR framework workflow: Input Traces -> StructureAgent -> InsightAgent -> ReportAgent -> Final Report.
Evaluation Highlights
  • +9.7% average improvement in report quality (human evaluation) compared to ClaudeCode across three benchmarks
  • +7.5% average improvement in report quality (LLM-as-a-judge) compared to ClaudeCode
  • +26.0% relative improvement in Agentic Coding scenarios when using GLM-5 as the backbone
Breakthrough Assessment
8/10
Addresses a critical bottleneck in agent development (debugging long traces) with a practical, open-source multi-agent framework. Strong empirical gains in report utility.
×