← Back to Paper List

Case-Aware LLM-as-a-Judge Evaluation for Enterprise-ScaleRAGSystems

M Chhabra, L Medrano, A Verma
Dell Technologies
arXiv, 2/2026 (2026)
RAG Factuality Benchmark

📝 Paper Summary

Modularized RAG pipeline
A case-aware evaluation framework for enterprise RAG systems that scores multi-turn support interactions using structured metadata and severity-weighted metrics rather than generic faithfulness signals.
Core Problem
Generic RAG metrics like faithfulness and relevance treat queries as single-turn and independent, failing to capture enterprise-specific failures like misinterpreting error codes, violating support workflows, or partial resolutions in multi-turn contexts.
Why it matters:
  • Enterprise support cases are multi-turn and operationally constrained; a technically relevant answer might still violate protocol or fail to resolve the specific case ID
  • Existing metrics (e.g., RAGAS) conflate retrieval accuracy with resolution quality, providing ambiguous signals that are insufficient for production release gating
  • Proxy metrics often fail to distinguish between models that differ significantly in their ability to handle complex, long-context diagnostic narratives
Concrete Example: A model might provide a factually correct answer about a server component (high faithfulness) but fail to address the specific error code mentioned three turns ago in the history (low case issue identification), leaving the support case unresolved.
Key Novelty
Case-Aware LLM-as-a-Judge Framework
  • Condition the LLM judge on full multi-turn history, structured case metadata (subject, description), and retrieval evidence, unlike standard judges that view only the current query and answer
  • Introduce eight operationally grounded metrics (e.g., Identifier Integrity, Workflow Alignment) specifically designed to catch enterprise support failures rather than just hallucination
  • Implement a severity-aware scoring protocol where critical failures (e.g., corrupted commands) heavily penalize the aggregate score, mirroring enterprise risk tolerance
Evaluation Highlights
  • Case-aware evaluation reveals statistically significant performance gaps (p=0.0011) between models on long-context queries that generic proxy metrics missed
  • Human validation shows strong alignment with the automated judge on high-risk failure modes: 91% agreement on Identifier Integrity and 88% on Hallucination
  • Evaluation cost scales linearly (approx. $0.014 per turn using GPT-4), enabling feasible batch regression testing for enterprise release cycles
Breakthrough Assessment
7/10
Significant practical contribution for enterprise RAG deployment. While not an architectural breakthrough, it effectively solves the 'metric gap' where generic academic metrics fail to predict production success in complex workflows.
×