Evaluation Setup
Evaluation on an out-of-distribution dataset of production incidents from a large IT corporation (Microsoft).
Benchmarks:
- Internal Incident Dataset (Root Cause Analysis (Retrieval/Diagnosis)) [New]
Metrics:
- Acc@k (Top-k Accuracy)
- Retrieval Precision/Recall
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Internal Incident Dataset |
Acc@5 |
60.6 |
64.3 |
+3.7
|
| Internal Incident Dataset |
Acc@1 |
28.5 |
32.1 |
+3.6
|
| Internal Incident Dataset |
Average Documents Retrieved |
10 |
5.7 |
-4.3
|
Main Takeaways
- ReAct agents can outperform standard RAG and fine-tuned baselines in zero-shot settings by dynamically refining search queries based on intermediate reasoning.
- The addition of discussion comments from historical incidents did not yield significant performance improvements, surprisingly.
- Agents are capable of utilizing specific identifiers (error codes, file paths) in queries, which dense retrievers in standard RAG often miss.
- Case study confirms feasibility but highlights the 'cold start' problem: agents need access to team-specific tools which may not exist as APIs.