← Back to Paper List

Can LLMs Be trusted for evaluatingRAGsystems? A survey of methods and datasets

L Brehme, T Ströhle, R Breu
Department of Computer Science, University of Innsbruck
IEEE Swiss Conference on … (2025)
RAG Benchmark QA Factuality

📝 Paper Summary

Benchmark Metrics and evaluation
This systematic literature review synthesizes 63 papers to categorize state-of-the-art evaluation methodologies for Retrieval-Augmented Generation (RAG) systems across datasets, indexing, retrieval, and generation components.
Core Problem
RAG systems involve multiple complex components (indexing, retrieval, generation), making systematic evaluation difficult due to a lack of standardized metrics, domain-specific datasets, and clear guidance on configuration.
Why it matters:
  • Companies lack clear guidance on the do's and don'ts of implementing and evaluating RAG systems for domain-specific applications
  • Previous reviews focused narrowly on retriever/generator metrics, often overlooking the critical impact of indexing strategies and dataset creation
  • Reliance on public datasets (like HotPotQA) is often impractical for specialized domains where RAG adds the most value
Concrete Example: Public datasets like HotPotQA often contain knowledge the LLM has already memorized during training, making the RAG component redundant. A valid evaluation requires domain-specific data (e.g., proprietary legal docs) where the model *must* use retrieval, but creating such datasets manually is cost-prohibitive.
Key Novelty
Holistic RAG Evaluation Taxonomy
  • Categorizes evaluation strategies into four distinct pillars: Datasets (creation/enhancement), Indexing/Database, Retriever, and Generator
  • Identifies a shift toward automated evaluation pipelines where LLMs serve as both dataset generators (creating synthetic QA pairs) and judges (scoring relevance/faithfulness)
  • Highlights the often-neglected component of indexing evaluation (e.g., embedding similarity and retrieval speed) as critical to overall system performance
Evaluation Highlights
  • Identified 87 distinct QA datasets used for benchmarking, ranging from short-answer to multi-hop reasoning tasks
  • Found 41 papers employing LLMs as judges, establishing automation as a dominant trend over human evaluation for scalability
  • Cataloged 7 distinct methods for determining document relevance in retriever evaluation, including LLM-based binary classification and sentence proportion metrics
Breakthrough Assessment
7/10
A comprehensive survey that fills a gap by including indexing and dataset generation in RAG evaluation. While it doesn't propose a new algorithm, it provides a crucial taxonomy for researchers.
×