← Back to Paper List

Interpreting Agentic Systems: Beyond Model Explanations to System-Level Accountability

Judy Zhu, Dhari Gandhi, Himanshu Joshi, Ahmad Rezaie Mianroodi, Sedef Akinli Kocak, Dhanesh Ramachandran
Vector Institute for Artificial Intelligence, University of Texas, Austin, Dalhousie University
arXiv (2026)
Agent Reasoning Memory

📝 Paper Summary

Agentic AI AI Safety Mechanistic Interpretability
Traditional interpretability methods fail for agentic systems because they cannot capture temporal dynamics and inter-agent dependencies, necessitating new system-level oversight mechanisms embedded across the agent lifecycle.
Core Problem
Existing interpretability techniques (like SHAP or LIME) were designed for static, single-model predictions and fail to explain the dynamic, compounding decisions of autonomous multi-agent systems.
Why it matters:
  • Agentic systems are entering safety-critical domains (healthcare, finance) where opacity can lead to undetectable cascading errors
  • Autonomy allows agents to modify decision boundaries through interaction, making static 'black box' explanations obsolete
  • Lack of visibility into inter-agent communication and memory usage makes it impossible to trace the root cause of goal misalignment or infinite loops
Concrete Example: SHAP values assume input features are substitutable to calculate marginal contributions. In an agentic system, if the 'Perception' module is removed, the 'Reasoning' module cannot function at all. Because these components are functionally irreplaceable (complementary) rather than additive, SHAP cannot meaningfully attribute credit for a system failure.
Key Novelty
Shift from Model-Centric to System-Centric Interpretability
  • Redefines the unit of analysis from individual model weights to the emergent behavior of interacting agents (perception, reasoning, memory, orchestration)
  • Identifies that opacity in agentic systems is a first-order failure mode caused by temporal uncertainty and distributed planning, not just model complexity
Architecture
Architecture Figure Figure 1
Comparison of 'AI Agent' architecture vs. 'Agentic System' architecture
Breakthrough Assessment
7/10
A strong foundational position paper that clearly articulates why current interpretability methods are mathematically and conceptually ill-suited for agentic workflows, though it does not yet propose a specific algorithmic solution.
×