GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

📝 Paper Summary

RAG Evaluation Hallucination detection Grounding attribution

GaRAGe is a RAG benchmark with human-annotated grounding for every retrieved passage, enabling precise evaluation of whether LLMs use only relevant information and deflect when grounding is insufficient.

Core Problem

Existing RAG benchmarks either evaluate final answers without checking if the grounding was actually used/relevant, or use synthetic/unannotated contexts that conflate retrieval quality with generation quality.

Why it matters:

Current metrics often reward 'correct' answers that ignore provided context (parametric memory), risking hallucination in real-world private data scenarios
LLMs frequently summarize all retrieved documents regardless of relevance, rather than filtering out noise
Real-world RAG systems must 'deflect' (refuse to answer) when retrieved information is insufficient, a capability rarely tested in current benchmarks

Concrete Example: A user asks about a specific policy update. The retriever returns an outdated policy document (irrelevant) and a generic web snippet. A standard LLM might hallucinate an answer based on training data or summarize the outdated policy. GaRAGe penalizes this by checking if the model strictly used passages annotated as 'relevant' or correctly deflected.

Key Novelty

GaRAGe (Grounding Annotations for RAG evaluation)

Provides snippet-level human annotations for 35k+ passages, labeling each as 'relevant', 'related', 'outdated', or 'unknown' relative to the question
Introduces Relevance-Aware Factuality (RAF), a metric that penalizes models for using information from retrieved passages that are actually irrelevant or outdated
Includes a specific evaluation subset for 'deflection', where the grounding is intentionally insufficient, testing the model's ability to say 'I don't know' instead of hallucinating

Architecture

The dataset construction pipeline for GaRAGe.

Evaluation Highlights

State-of-the-art models (including GPT-4o) reach at most 60% on Relevance-Aware Factuality (RAF), showing they struggle to filter irrelevant context
In deflection scenarios (insufficient grounding), the best model (GPT-4o) achieves only a 31.1% true positive rate, frequently hallucinating answers instead of refusing
Performance drops significantly (~10%) on time-sensitive 'Fast-Changing' questions, indicating LLMs struggle to reason about the temporal validity of grounding

Breakthrough Assessment

8/10

Significant contribution to RAG evaluation by addressing the 'black box' nature of context usage. The granular human annotation of grounding relevance allows for much stricter and more realistic assessment of hallucination and noise robustness than existing datasets.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where the system receives a query and a set of retrieved passages (mixed relevance), and must generate a grounded answer or deflect.

Inputs: Question Q, set of retrieved passages P (some relevant, some noise/outdated)

Outputs: Long-form answer A utilizing only relevant p ∈ P, or a deflection response if relevant information is missing

Pipeline Flow

Question Generation (Multi-step LLM pipeline)
Grounding Collection (Web Search + Private KB Retrieval)
Human Annotation (Relevance labeling & Answer writing)
Evaluation (LLM-as-a-judge metrics)

System Modules

Question Generator (Data Construction)

Generate complex questions requiring multi-hop reasoning or temporal awareness

Model or implementation: LLM (specific model not named, likely proprietary)

Grounding Retriever (Data Construction)

Retrieve passages from Web and Private KBs (Enron, Arxiv, SEC filings)

Model or implementation: Proprietary search engine + Cross-encoder reranker

LLM Judge

Evaluate model outputs for eligibility, factuality, and deflection

Model or implementation: GPT-4o (temperature 0.2)

Novel Architectural Elements

Annotation schema combining passage-level relevance (Answer/Related/Outdated/Unknown) with question-level attributes (Time-sensitivity, Complexity)
Metric design (RAF) that strictly conditions factuality on *annotated* relevance, distinguishing it from standard RAG metrics that assume all retrieved context is ground truth

Modeling

Base Model: Various (GPT-4o, Gemini 1.5 Flash, Claude 3.5 Sonnet, Nova Pro, Mistral, Qwen2.5)

Training Method: Not applicable (Evaluation paper)

Adaptation: None (Inference-only evaluation)

Trainable Parameters: 0

Compute: Not reported in the paper

Comparison to Prior Work

vs. RGB [not cited in paper]: GaRAGe offers fine-grained human annotations for *every* passage (35k+), whereas RGB relies largely on synthetic setups or simpler labeling.
vs. CRUD-RAG: GaRAGe focuses on the *grounding* quality and the model's ability to distinguish relevant from irrelevant context, rather than database operations.
vs. RAGAS: RAGAS computes metrics on unannotated retrieved contexts (assuming retrieval is 'correct' or checking faithfulness to whatever is retrieved). GaRAGe's RAF metric checks faithfulness *only* to passages humans labeled as relevant, penalizing hallucination from noise.

Limitations

Dependency on GPT-4o as a judge for final scoring, which may introduce bias despite human-verified ground truth.
Private knowledge base sources (Enron, etc.) are simulated via public datasets, which may not fully capture the complexity of proprietary enterprise data.
Evaluation is limited to English language.
The 'Fast-Changing' category relies on the relative age of documents, which can be difficult to determine precisely for all web content.

Reproducibility

Code: https://github.com/amazon-science/GaRAGe

The dataset and evaluation prompts are publicly released on GitHub. The specific prompt templates for generating the dataset (Question Construction) are in the Appendix. The proprietary search engine used for retrieval is not named/released.

📊 Experiments & Results

Evaluation Setup

Open-domain and Private-domain QA with retrieval. Models must generate long-form answers with citations based on provided context.

Benchmarks:

GaRAGe (Retrieval-Augmented Generation (QA)) [New]

Metrics:

Relevance-Aware Factuality (RAF)
Unadjusted Relevance-Aware Factuality (uRAF)
Eligibility (faithfulness to user intent)
Deflection True Positive Rate (TPR)
Citation F1 (Attribution)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing model performance on Eligibility (answering the prompt well) and Relevance-Aware Factuality (answering using ONLY relevant docs).
GaRAGe	Eligibility	90.2	92.6	+2.4
GaRAGe	RAF (Relevance-Aware Factuality)	57.7	60.0	+2.3
GaRAGe	uRAF (Unadjusted Factuality)	62.3	63.8	+1.5
Deflection experiments measure the ability to refuse to answer when grounding is insufficient (True Positive) vs. refusing when grounding is sufficient (False Positive).
GaRAGe (Deflection Subset)	True Positive Rate (TPR)	26.9	31.1	+4.2
GaRAGe (Sufficient Subset)	False Positive Rate (FPR)	0.3	1.4	+1.1
Attribution results measure how accurately models cite the relevant sources in their generated answers.
GaRAGe	Citation F1	57.3	58.9	+1.6

Experiment Figures

RAF scores broken down by temporal dynamism (Fast-Changing, Slow-Changing, Static).

Model performance (RAF score) as a function of grounding quality (percentage of relevant passages).

Main Takeaways

Over-summarization issue: Models tend to incorporate information from all retrieved chunks, failing to distinguish between relevant and irrelevant/outdated snippets (evidenced by low RAF scores).
Deflection failure: Even strong models like GPT-4o only correctly refuse to answer ~30% of the time when grounding is insufficient, posing a risk for reliable RAG.
Temporal reasoning gap: Models perform ~10% worse on 'Fast-Changing' questions, suggesting they struggle to identify which document is the most current.
Domain sensitivity: Performance drops significantly (>10%) on private/specific domains (e.g., Enron emails) compared to general Web search topics.

📚 Prerequisite Knowledge

Prerequisites

RAG (Retrieval-Augmented Generation) pipeline components
Evaluation metrics for text generation (Factuality, Hallucination)
Concept of 'Gold' vs. 'Silver' annotations

Key Terms

Grounding: The specific retrieved text passages provided to the LLM as context to answer a query

Relevance-Aware Factuality (RAF): A metric measuring the percentage of answers that are both eligible (fluent/helpful) and supported strictly by passages annotated as relevant

Deflection: The capability of an LLM to refuse to answer a question when the provided grounding information is insufficient or irrelevant

Parametric Knowledge: Information stored in the model's pre-trained weights, as opposed to information provided in the context window (non-parametric)

Attribution: The practice of citing specific source documents to support claims made in the generated answer

Reranking: The process of re-ordering retrieved documents to prioritize the most relevant ones before passing them to the generator

Cross-encoder: A model architecture that processes query and document pairs together to output a relevance score, often used for reranking