← Back to Paper List

CCRS: A Zero-Shot LLM-as-a-Judge Framework for ComprehensiveRAGEvaluation

A Muhamed
Carnegie Mellon University
arXiv, 6/2025 (2025)
RAG Factuality Benchmark

📝 Paper Summary

Modularized RAG pipeline Evaluation methodology
CCRS evaluates RAG systems using a single Llama-70B model as a zero-shot judge to score five dimensions directly, replacing complex multi-step pipelines.
Core Problem
Existing RAG evaluation methods either rely on inadequate lexical overlap metrics (BLEU/ROUGE) or complex, computationally expensive pipelines (RAGChecker, RAGAS) that require intermediate steps like claim extraction or specialized fine-tuning.
Why it matters:
  • Standard metrics like BLEU fail to measure factual grounding or hallucination, critical for domains like medicine
  • Complex pipelines like RAGChecker are computationally heavy and brittle; errors in intermediate steps (like claim extraction) propagate to final scores
  • Evaluating faithfulness and relevance efficiently is necessary for rapid iteration during RAG system development
Concrete Example: Traditional metrics might score a plausible-sounding but hallucinatory answer highly if it overlaps lexically with a reference. Conversely, RAGAS requires decomposing an answer into statements and running NLI on each, which is slow. CCRS aims to judge the answer 'in one go'.
Key Novelty
CCRS (Contextual Coherence and Relevance Score)
  • Uses a single powerful LLM (Llama-70B) as a zero-shot judge to evaluate five specific quality dimensions simultaneously without intermediate processing steps
  • Defines a specific prompt-based framework for measuring Contextual Coherence, Question Relevance, Information Density, Answer Correctness, and Information Recall directly
Evaluation Highlights
  • CCRS metrics effectively discriminate between reader models, confirming Mistral-7B outperforms Llama-2 variants on BioASQ
  • Demonstrates that the E5 neural retriever specifically enhances Question Relevance and Information Recall compared to BM25
  • Achieves discriminative power comparable to the complex RAGChecker framework for recall and faithfulness while being significantly more computationally efficient
Breakthrough Assessment
7/10
Provides a practical, streamlined alternative to complex RAG evaluation pipelines. While not introducing a new fundamental algorithm, its efficiency and comprehensive validation on biomedical data make it a valuable tool.
×