← Back to Paper List

RAGProbe: An automated approach for evaluatingRAGapplications

S Sivasothy, S Barnett, S Kurniawan, Z Rasool…
Applied Artificial Intelligence Institute, Deakin University
arXiv, 9/2024 (2024)
RAG Benchmark QA

📝 Paper Summary

Modularized RAG pipeline RAG Evaluation
RAGProbe automates RAG pipeline evaluation by generating diverse, domain-specific question-answer pairs (evaluation scenarios) from a corpus to trigger and measure specific failure modes.
Core Problem
Evaluating RAG pipelines is currently a manual, trial-and-error process that lacks a systematic way to generate domain-specific test cases covering complex failure scenarios (e.g., questions spanning multiple documents).
Why it matters:
  • Existing tools (like RAGAS) lack schemas for capturing different question types and fail to generate templates for specific failure modes.
  • Manual evaluation is time-consuming and cannot scale to the infinite variations of questions users might ask against proprietary corpora.
  • Developers lack visibility into which specific RAG component (retrieval vs. generation) causes failures in complex scenarios.
Concrete Example: When a user asks a combined question requiring information from two different documents (e.g., 'What are the interest rates in Doc A and penalties in Doc B?'), standard RAG pipelines often fail to retrieve both chunks or synthesize the answer, failing 91% of the time in the authors' study.
Key Novelty
RAGProbe: Scenario-Based Automated Evaluation
  • Defines an 'Evaluation Scenario' schema that includes document sampling, chunking strategies, and specific prompt templates to target distinct RAG capabilities (e.g., multi-document reasoning, negative constraints).
  • Synthesizes domain-specific QA pairs based on these scenarios to act as 'test cases' for the pipeline.
  • Systematically triggers known failure points (like multi-hop questions or unanswerable questions) rather than just checking general relevance.
Evaluation Highlights
  • Identified a 91% failure rate in open-source RAG pipelines for questions requiring answers spanning multiple documents.
  • Outperformed state-of-the-art (RAGAS) by generating more valid QA pairs (98% vs 93% on Google NQ) and triggering 51% more failures on average.
  • Revealed a 78% failure rate for questions combining multiple sub-questions from a single document across 5 open-source pipelines.
Breakthrough Assessment
7/10
Strong practical contribution for RAG engineering. It shifts evaluation from generic metrics to scenario-based testing, which is crucial for reliability, though the underlying technique is primarily prompt engineering and workflow automation.
×