Ragbench: Explainable benchmark for retrieval-augmented generation systems

📝 Paper Summary

RAG Evaluation Benchmark Datasets

RAGBench is a 100k-example multi-domain benchmark that facilitates granular RAG evaluation via the TRACe framework, showing that fine-tuned small language models outperform large LLM judges.

Core Problem

Comprehensive evaluation of RAG systems is hindered by the lack of unified benchmarks and reliance on disjoint, irreproducible evaluation criteria (like context relevance vs. answer faithfulness) that vary across studies.

Why it matters:

RAG systems in production are prone to hallucinations and retrieval failures, necessitating rigorous fine-tuning and evaluation.
Current benchmarks are small, domain-specific, or lack granular annotations, making it hard to compare different RAG configurations or evaluation approaches.
Zero-shot LLM-based evaluation (e.g., GPT-4 as a judge) is expensive and often inconsistent compared to ground-truth benchmarks.

Concrete Example: A RAG system might retrieve relevant documents but fail to use them (low Utilization), or use them but include hallucinations (low Adherence). Existing metrics like 'correctness' are too coarse to distinguish these failure modes, preventing developers from knowing whether to fix the retriever or the generator.

Key Novelty

TRACe Framework & RAGBench Dataset

Constructs a massive (100k) dataset from 12 heterogeneous sources (finance, legal, biomedical) converted into a standardized RAG format with 'silver' labels generated by GPT-4 and validated by humans.
Introduces TRACe (uTilization, Relevance, Adherence, Completeness) to granularly measure which specific tokens in the context are relevant and which are actually used by the generator.
Demonstrates that a small, fine-tuned DeBERTa model can outperform few-shot GPT-4 in predicting these fine-grained evaluation metrics.

Architecture

Conceptual diagram of the RAG pipeline and the variables varied in RAGBench construction.

Evaluation Highlights

Fine-tuned DeBERTa-large (400M parameters) outperforms GPT-4-based judges on RAG evaluation tasks across multiple domains.
The proposed TRACe metrics achieve 93% example-level and 95% span-level agreement with human judgments on the DelucionQA test split.
RAGBench covers 5 distinct domains (biomedical, general knowledge, legal, customer support, finance) with context lengths ranging from 100 to 11k tokens.

Breakthrough Assessment

8/10

Significantly scales up RAG evaluation resources (100k examples vs typical <5k) and provides a compelling case for using small, specialized evaluation models over expensive LLM judges.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Retrieval-Augmented Generation systems

Inputs: A tuple (Query, Context Documents, Generated Response)

Outputs: Scalar scores or boolean labels for TRACe metrics (Relevance, Utilization, Adherence, Completeness)

Pipeline Flow

Data Collection (12 source datasets)
Standardization (formatting to RAG schema)
Response Generation (GPT-3.5, Claude 3 Haiku)
Annotation (GPT-4 with CoT prompting)
Metric Computation (TRACe)

System Modules

Data Aggregator (Data Construction)

Combine and standardize 12 distinct datasets (e.g., HotpotQA, PubMedQA, CUAD) into a unified format

Model or implementation: N/A

Response Generator (Data Construction)

Generate synthetic RAG responses to create the benchmark targets

Model or implementation: GPT-3.5-turbo-0125 and Claude 3 Haiku

Annotator

Generate silver-standard labels for Relevance, Utilization, and Adherence

Model or implementation: GPT-4-0125-preview

Evaluation Model

Predict TRACe metrics given an input tuple, replacing the expensive GPT-4 annotator

Model or implementation: DeBERTa-large (400M parameters)

Novel Architectural Elements

TRACe metric definitions: Specifically the formalization of 'Context Utilization' (fraction of context used) and 'Completeness' (fraction of relevant context used) as distinct from Relevance and Adherence.

Modeling

Base Model: DeBERTa-large (400M parameters)

Training Method: Fine-tuning (Supervised Learning)

Objective Functions:

Purpose: Minimize classification error on RAG evaluation tasks.

Formally: Not explicitly detailed, but implied standard cross-entropy for classification/NLI tasks.

Training Data:

100k examples split into train/validation/test
Source domains: Bio-medical, General Knowledge, Legal, Customer Support, Finance

Key Hyperparameters:

base_model_parameters: 400M

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAGAS/TruLens: RAGBench provides a labeled dataset for training specialized evaluators, rather than relying on zero-shot LLM prompts.
vs. ARES: RAGBench introduces new metrics (Utilization, Completeness) and covers a much larger, multi-domain scale (100k vs smaller in-domain datasets).
vs. RAGTruth: RAGBench includes metrics beyond just hallucination (e.g., retrieval quality via Relevance/Utilization) and is significantly larger.

Limitations

Relies on GPT-4 for 'silver' ground truth labels (though validated on a human-annotated subset).
Evaluation of the fine-tuned DeBERTa model is mentioned as 'outperforming' LLMs but specific comparative tables with exact numbers for this comparison are not provided in the main text (qualitative claim in Introduction/Abstract).
Only supports English language tasks.

Reproducibility

Code: https://huggingface.co/datasets/rungalileo/ragbench

Dataset is publicly available on HuggingFace. The specific fine-tuning code for the DeBERTa model is not linked, but the dataset availability allows for replication of the benchmarking results. The prompt templates for generation and annotation are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Benchmarking RAG evaluation models (Judges) against the RAGBench dataset

Benchmarks:

RAGBench (RAG Evaluation (Metric Prediction)) [New]
DelucionQA (Human-validated subset for metric validation)

Metrics:

Context Relevance
Context Utilization
Adherence (Faithfulness)
Completeness
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DelucionQA	Example-level Agreement	100	93	-7
DelucionQA	Span-level Agreement	100	95	-5
RAGBench	Hallucination Rate	0	12	+12
RAGBench	Context Length	0	11000	+11000

Experiment Figures

Impact of RAG configuration parameters (Retriever, Generator, Prompt) on TRACe metrics.

Main Takeaways

A 400M-parameter DeBERTa model fine-tuned on RAGBench outperforms few-shot LLM judges, suggesting specialized small models are viable for RAG evaluation.
The TRACe metrics (specifically combining Utilization and Relevance) provide granular insights: low Utilization + low Relevance = greedy retriever; low Utilization alone = weak generator.
Chain-of-thought prompting reduces hallucinations (improves Adherence) for stronger models like GPT-4o but can increase hallucinations for weaker models like GPT-3.5.
Choice of retriever heavily influences context relevance scores, while choice of generator heavily influences adherence and completeness.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG architectures (retriever + generator)
Familiarity with LLM evaluation metrics (hallucination, faithfulness)
Knowledge of BERT/DeBERTa architectures for classification

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

TRACe: A framework evaluating uTilization, Relevance, Adherence, and Completeness of RAG responses

Adherence: Whether the response is strictly grounded in the provided context (synonymous with faithfulness)

Utilization: The fraction of retrieved context tokens that are actually used by the generator to produce the response

Completeness: The fraction of relevant context information that is incorporated into the response

Relevance: The fraction of retrieved context tokens that are actually relevant to the input query

NLI: Natural Language Inference—determining if a hypothesis is true (entailment), false (contradiction), or unrelated (neutral) given a premise

DeBERTa: Decoding-enhanced BERT with disentangled attention—a transformer model optimized for natural language understanding tasks

Chain of Thought (CoT): Prompting strategy that asks the model to generate intermediate reasoning steps before the final answer