Exploring LLM-Based Agents for Root Cause Analysis

📝 Paper Summary

Agentic RAG pipeline Multi-call tool use with flexible plan

This paper evaluates a ReAct-based LLM agent for cloud incident root cause analysis, demonstrating that agents can perform competitively without fine-tuning by dynamically retrieving historical incidents and diagnostic data.

Core Problem

Automating Root Cause Analysis (RCA) is difficult because static LLMs cannot dynamically query external diagnostic data (logs, metrics) or specialized team knowledge required to diagnose complex cloud incidents.

Why it matters:

Cloud incident management is labor-intensive; manual RCA requires deep domain expertise and significant time from On-Call Engineers (OCEs).
Prior automated methods (fine-tuned LLMs or classification-based Copilots) lack the agency to fetch new, real-time diagnostic information, limiting their ability to solve novel incidents.

Concrete Example: An incident involving a service failure might require checking specific logs to find a 'connection refused' error. A standard LLM sees only the vague incident report, whereas an agent can decide to query the log database, find the error, and deduce the root cause.

Key Novelty

ReAct Agent for Zero-Shot RCA

Adapts the ReAct (Reasoning + Acting) framework to the RCA domain, allowing an LLM to interleave reasoning thoughts with tool execution (e.g., retrieving historical incidents or querying logs).
Evaluates the agent in a 'restricted' setting (static dataset) to establish baselines and a 'case study' setting (real production environment) to demonstrate the value of dynamic tool usage.

Architecture

A sample trajectory of the ReAct agent performing RCA.

Evaluation Highlights

ReAct agent achieves 64.3% top-5 accuracy on the standard evaluation set, outperforming the RAG baseline (60.6%) while using significantly fewer retrieval tokens.
The agent reduces the number of retrieved documents needed: ReAct uses average 5.7 documents vs 10 for baselines, yet achieves higher accuracy.
In a qualitative analysis, ReAct effectively utilized specific retrieval queries (e.g., searching for specific error codes) to narrow down root causes where standard similarity search failed.

Breakthrough Assessment

7/10

Solid empirical evaluation of agents in a high-value industrial domain. While not proposing a new architecture, it provides the first rigorous study of ReAct for RCA without fine-tuning, highlighting practical deployment challenges.

⚙️ Technical Details

Problem Definition

Setting: Given an incident report I (title, description, metadata), identify the specific root cause R from a set of possible causes or by generating a diagnosis.

Inputs: Incident Report containing title, description, and potentially initial logs/stacktraces.

Outputs: A generated root cause diagnosis or identification of the correct root cause.

Pipeline Flow

Incident Trigger -> LLM Planner (ReAct Loop) -> Tool Selection -> Observation -> Final Diagnosis

System Modules

LLM Planner

Orchestrates the diagnosis process by generating thoughts and selecting tools based on the incident context.

Model or implementation: GPT-4 (implied from context of 'strong baselines' and Microsoft affiliation, though specific model version not explicitly detailed in snippet)

Historical Incident Retriever (ReAct BR) (Tools)

Retrieves similar past incidents based on a query generated by the planner.

Model or implementation: SentenceTransformer (for retrieval)

Incident QA Tool (Tools)

Allows the agent to ask specific questions about the raw incident description (which might be long or messy).

Model or implementation: LLM-based QA

Novel Architectural Elements

Application of ReAct loop specifically to the RCA domain, enabling dynamic iterative query refinement for historical incident retrieval unlike standard one-shot RAG.

Modeling

Base Model: GPT-4 (implied based on Microsoft/OpenAI context and performance levels, explicitly mentioned 'LLMs' generally)

Training Method: Zero-shot prompting with ReAct framework

Compute: Restricted to maximum 20 iterations for the ReAct loop due to time/resource constraints.

Comparison to Prior Work

vs. Ahmed et al.: ReAct uses zero-shot with tools instead of fine-tuning; can query dynamically.
vs. RCACopilot: ReAct is autonomous in planning diagnostic steps rather than following hard-coded workflows or predefined handlers.

Limitations

Evaluation is primarily on a static dataset where the agent cannot actually query live systems (simulated lower bound).
Real-world tool construction requires significant engineering effort to expose diverse diagnostic services safely.
Privacy and security concerns regarding exposing production logs to LLMs.
Zero-shot reasoning can sometimes lead to loops or hallucinated tool usage without careful constraints.

📊 Experiments & Results

Evaluation Setup

Evaluation on an out-of-distribution dataset of production incidents from a large IT corporation (Microsoft).

Benchmarks:

Internal Incident Dataset (Root Cause Analysis (Retrieval/Diagnosis)) [New]

Metrics:

Acc@k (Top-k Accuracy)
Retrieval Precision/Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Incident Dataset	Acc@5	60.6	64.3	+3.7
Internal Incident Dataset	Acc@1	28.5	32.1	+3.6
Internal Incident Dataset	Average Documents Retrieved	10	5.7	-4.3

Main Takeaways

ReAct agents can outperform standard RAG and fine-tuned baselines in zero-shot settings by dynamically refining search queries based on intermediate reasoning.
The addition of discussion comments from historical incidents did not yield significant performance improvements, surprisingly.
Agents are capable of utilizing specific identifiers (error codes, file paths) in queries, which dense retrievers in standard RAG often miss.
Case study confirms feasibility but highlights the 'cold start' problem: agents need access to team-specific tools which may not exist as APIs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting
Familiarity with Cloud Incident Management (RCA)
Knowledge of ReAct (Reasoning and Acting) framework

Key Terms

RCA: Root Cause Analysis—the process of identifying the fundamental reason for a system failure or incident.

ReAct: Reason+Act—a framework where LLMs generate reasoning traces ('Thoughts') and then execute external actions ('Actions') in an interleaved manner.

OCE: On-Call Engineer—the person responsible for responding to and mitigating production incidents.

RAG: Retrieval-Augmented Generation—providing an LLM with relevant external documents to improve its answers.

TSG: Troubleshooting Guide—structured documentation used by engineers to diagnose known issues.

Zero-shot: Evaluating a model without providing any specific training examples for the task in the prompt.

Chain of Thought: A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.