AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

📝 Paper Summary

Multi-agent Self-evolving Agentic reasoning

AgenTracer is an automated framework that trains a lightweight model to pinpoint the specific agent and step responsible for multi-agent system failures using counterfactual replay and fault injection.

Core Problem

Multi-agent systems are fragile and prone to failure, but identifying exactly which agent or step caused an error in long, verbose trajectories is difficult for current LLMs.

Why it matters:

System debugging is currently a manual, labor-intensive process due to the complexity of multi-agent interactions and tool invocations
Current SOTA reasoning models like DeepSeek-R1 and GPT-4 fail catastrophically at this task (accuracy often <10%), preventing automated self-correction
Existing failure attribution benchmarks are too small (approx. 200 samples), limiting systematic evaluation and improvement

Concrete Example: In a document analysis task where the final answer is 'North', Qwen3-8B blames a Coder agent for a script error at Step 6. However, the true root cause was the Web Surfer agent downloading an outdated file at Step 2, which only caused the script to crash later. AgenTracer correctly identifies Step 2.

Key Novelty

AgenTracer (Automated Failure Attribution Framework)

systematically replaces agent actions with oracle guidance (counterfactual replay) to find the exact step where failure becomes inevitable
Synthetically generates training data by corrupting successful trajectories (fault injection), creating pairs of 'failed trajectory' and 'known root cause'
Trains a lightweight model (AgenTracer-8B) using Reinforcement Learning with a multi-granular reward that scores both agent identification and temporal proximity to the error step

Architecture

The complete pipeline: from collecting successful/failed trajectories, to annotation via counterfactual replay (for failures) and fault injection (for successes), to training AgenTracer-8B via RL.

Evaluation Highlights

Outperforms giant proprietary models like Gemini-2.5-Pro (+18.18%) and Claude-4-Sonnet (+12.21%) on the Who&When benchmark
Boosts the performance of existing multi-agent systems (e.g., MetaGPT, MaAS) by 4.8% to 14.2% when used to provide corrective feedback
Achieves 69.62% agent-level accuracy on automated benchmarks compared to 58.73% for the base Qwen3-8B model

Breakthrough Assessment

8/10

Significant advancement in automated debugging for agents. The method effectively solves the 'credit assignment' problem in agentic systems, enabling self-correcting loops that actually work where previous critique methods failed.

⚙️ Technical Details

Problem Definition

Setting: Given a failed multi-agent trajectory τ consisting of states and actions from N agents, identify the decisive error pair (i*, t*)

Inputs: A failed trajectory log τ (states, actions, tool outputs)

Outputs: The failure-responsible agent i* and the decisive error step t*

Pipeline Flow

Input Trajectory Log -> AgenTracer-8B -> Reasoning Trace -> Output (Agent ID | Step ID)

System Modules

AgenTracer-8B

Analyze the trajectory to locate the error

Model or implementation: Qwen3-8B fine-tuned with RL

Novel Architectural Elements

Integration of counterfactual replay logic into the data generation pipeline rather than the inference model itself
Multi-granular reward function combining discrete agent-level accuracy with continuous Gaussian-kernel step-level accuracy

Modeling

Base Model: Qwen3-8B

Training Method: Group Relative Policy Optimization (GRPO) with online RL

Objective Functions:

Purpose: Optimize policy to maximize multi-granular reward.

Formally: L_RL = -E[min(ρ * A, clip(ρ, 1-B, 1+B) * A)] where A is advantage and ρ is policy ratio.
Purpose: Incentivize correct formatting.

Formally: I_format = 1 if output follows <think>...</think><answer>...</answer> structure, else 0.
Purpose: Reward correct agent identification.

Formally: r_agent = 1 if predicted agent matches ground truth, else 0.
Purpose: Reward temporal proximity to error step.

Formally: r_step = exp(-(predicted_step - true_step)^2 / 2σ^2).

Training Data:

TracerTraj-2.5K: 2,000+ annotated trajectories
Includes real failures annotated via counterfactual replay
Includes synthetic failures created via programmatic fault injection on successful runs
Covering 6 frameworks (MetaGPT, AutoGen, etc.) and 6 datasets (GAIA, GSM8K, etc.)

Key Hyperparameters:

batch_size: 32
learning_rate: 1e-6
rollout_number: 8
+ 3 more
lambda_reward_weight: 0.5
sigma_reward_decay: 1
dynamic_clipping_parameter_Bs: Decays based on training steps

Compute: Trained on 8 NVIDIA H100 GPUs

Comparison to Prior Work

vs. MAST/Who&When: AgenTracer provides automated training data generation at scale (2.5k vs 200 samples) and trains a dedicated model
vs. DeepSeek-R1/GPT-4: Uses specialized RL fine-tuning for attribution rather than zero-shot prompting, achieving higher accuracy with smaller parameter count
vs. CollabUIAgents [not cited in paper]: Implicitly achieves credit assignment via attribution rather than binary scalar rewards

Limitations

Evaluation relies heavily on the quality of the 'Oracle' used for counterfactual replay (DeepSeek-R1)
Requires re-simulation of trajectories for data generation, which can be computationally expensive
Performance depends on the diversity of the underlying multi-agent frameworks used in training data

Reproducibility

Code: https://bingreeky.github.io/atracer/

Project page at https://bingreeky.github.io/atracer/ contains code and data. The paper details the specific baselines, RL platform (verl), and data collection methods.

📊 Experiments & Results

Evaluation Setup

Attribute the failure (Agent and Step) in a provided trajectory log

Benchmarks:

Who&When (handcrafted) (Failure Attribution (Magnetic-One based))
Who&When (automated) (Failure Attribution (AG2 based))
TracerTraj (Test Split) (Failure Attribution (Code, Math, Agentic)) [New]

Metrics:

Agent-level Accuracy (identifying faulty agent)
Step-level Accuracy (identifying faulty step)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis on the Who&When benchmark demonstrates AgenTracer-8B's superiority over much larger models in both agent identification and step localization.
Who&When (handcrafted)	Agent-level Accuracy (w/ Ground Truth)	56.90	69.10	+12.20
Who&When (handcrafted)	Step-level Accuracy (w/ Ground Truth)	17.24	20.68	+3.44
Who&When (automated)	Step-level Accuracy (w/o Ground Truth)	29.52	37.30	+7.78
Evaluation on the internal TracerTraj test set across different domains (Code, Math, Agentic).
TracerTraj-Agentic	Agent-level Accuracy (w/ Ground Truth)	37.16	53.28	+16.12
MaAS + MATH-500	Success Rate Improvement	Not reported in the paper	Not reported in the paper	+14.21

Experiment Figures

Bar charts showing performance improvement of downstream Multi-Agent Systems (MaAS, OWL, MetaGPT) when using AgenTracer feedback vs. baselines (Self-Refine, CRITIC).

A case study comparing failure attribution by Qwen3-8B, Claude-4-Sonnet, and AgenTracer-8B on a document analysis task.

Main Takeaways

Standard reasoning models (DeepSeek-R1, GPT-4) perform poorly at failure attribution (<10% step accuracy in many cases), often getting confused by long contexts.
Providing the ground truth answer (w/ G) does not always help baselines and sometimes hurts performance, confusing the model.
AgenTracer-8B generalizes well even without access to the ground truth solution, making it viable for real-time debugging.
Feedback from AgenTracer enables self-correction in existing multi-agent systems (MetaGPT, OWL), yielding performance gains up to 14% where standard 'Self-Refine' fails.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based multi-agent systems (roles, tool use)
Reinforcement Learning basics (policy optimization, rewards)
Concept of counterfactual reasoning

Key Terms

Failure Attribution: The process of identifying which specific component (agent) and action (step) caused a system failure

Counterfactual Replay: A method of finding errors by replacing an action with a correct 'oracle' action and seeing if the system succeeds; if it does, the replaced action was the error

Decisive Error: The earliest action in a trajectory whose correction is sufficient to steer the system from failure to success

Programmatic Fault Injection: Creating synthetic training data by taking a successful trajectory and intentionally corrupting one step to cause a failure

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs against each other rather than a fixed baseline

Oracle Rectification Operator: An idealized process where an original action is replaced by a theoretically perfect action to test causal links to failure

Multi-granular Reward: A scoring function that evaluates performance at different levels of detail (e.g., broad agent identification vs. precise step localization)