CRAG--Comprehensive RAG Benchmark

📝 Paper Summary

Benchmark datasets Modularized RAG pipeline

CRAG is a diverse factual QA benchmark for RAG systems featuring 4,409 questions with mock APIs and full HTML pages, revealing that even state-of-the-art RAG solutions struggle with dynamic, long-tail, and complex queries.

Core Problem

Existing RAG benchmarks (like NQ or MS MARCO) rely on static, text-only snippets and fail to represent the diverse, dynamic nature of real-world QA (e.g., varying popularity, temporal changes, and complex reasoning).

Why it matters:

Current LLMs achieve <35% accuracy on less popular (torso-to-tail) facts, and hallucinations remain a critical barrier to trustworthy systems
Standard metrics like ROUGE/F1 are insufficient for free-form generation; a reliable, fine-grained evaluation of truthfulness is missing
Real-world RAG systems must handle structured data (KGs), full HTML parsing, and time-sensitive information, which traditional static benchmarks ignore

Concrete Example: A question asking for 'most popular action movies in 2023' requires aggregating recent data. An LLM might hallucinate based on outdated training data, while a naive RAG system might fail to parse the structured list from a retrieved HTML page or mock API.

Key Novelty

Comprehensive RAG Benchmark (CRAG) with Mock Environments

Introduces a dataset of 4,409 QA pairs covering 5 domains (Finance, Sports, etc.) and 8 question types (including complex ones like aggregation and false-premise)
Provides a realistic retrieval environment including mock Knowledge Graph (KG) APIs and up to 50 full HTML pages per question, simulating real-world search noise
Implements a scoring system that penalizes hallucinations more severely than 'I don't know' answers to prioritize trustworthiness

Architecture

Illustration of a RAG system workflow interacting with the CRAG benchmark

Evaluation Highlights

State-of-the-art industry RAG solutions achieve only 63% truthfulness (answering without hallucination), highlighting significant reliability gaps
Basic RAG improves LLM accuracy from ≤34% to 44%, but often introduces more hallucinations due to distraction by retrieval noise
Web search recall drops significantly for Knowledge Graph questions (74%) compared to Web questions (93%), validating the need for hybrid Web+KG retrieval

Breakthrough Assessment

9/10

A major step forward for RAG evaluation. It moves beyond simple Wikipedia retrieval to include mock APIs, full HTML, and temporal dynamics, setting a new standard for realism in QA benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering where a system uses external tools (Web search, KG APIs) to answer questions

Inputs: Natural language question Q

Outputs: Answer A labeled as Perfect, Acceptable, Missing, or Incorrect

Pipeline Flow

Input Question
Retrieval (Web Search / Mock APIs)
Ranking/Filtering
Generation (LLM)

System Modules

Benchmark Dataset

Provide QA pairs and ground truths

Model or implementation: N/A

Mock Environment

Simulate external knowledge sources

Model or implementation: Mock APIs + Brave Search API

Evaluator

Grade system answers

Model or implementation: Model-based judge (ChatGPT + Llama 3)

Novel Architectural Elements

Integration of Mock APIs alongside web search results to test structured data querying capabilities
Inclusion of full HTML pages (not just snippets) to test parsing and extraction robustness

Modeling

Base Model: Evaluated multiple LLMs: GPT-4 Turbo, Llama 3 (8B/70B), Llama 2, Mixtral, Falcon, FLAN-T5

Comparison to Prior Work

vs. NQ/MS MARCO: CRAG includes mock APIs and full HTML pages, not just text snippets or Wikipedia [not cited in paper]
vs. TriviaQA: CRAG covers dynamic (time-sensitive) facts and varied entity popularity (Head/Torso/Tail)
vs. FreshQA [not cited in paper]: CRAG includes structured KG retrieval in addition to web search for handling dynamic facts

Limitations

Evaluation relies on LLM judges (ChatGPT/Llama 3), which may have biases despite high correlation with human judgment
Mock APIs are simulated and may not capture the full complexity of real-world API authentication or latency
Search retrieval component is partially fixed (cached pages provided) rather than fully open web access
Truthfulness metric heavily penalizes incorrect answers (-1), which might discourage risk-taking in generation

Reproducibility

Code: https://github.com/facebookresearch/CRAG/

publicly available (https://github.com/facebookresearch/CRAG/). Includes dataset, mock APIs, and evaluation scripts. Validation (30%) and Public Test (30%) sets are released; Private Test (40%) is held out for KDD Cup.

📊 Experiments & Results

Evaluation Setup

Factual QA with 3 tasks: Retrieval Summarization, KG/Web Retrieval Augmentation, End-to-end RAG

Benchmarks:

CRAG (Comprehensive RAG Benchmark) (Factual QA with Retrieval) [New]

Metrics:

Accuracy (Perfect + Acceptable)
Truthfulness (Score accounting for hallucinations)
Hallucination Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Baseline performance of advanced LLMs without RAG shows significant limitations.
CRAG	Accuracy	Not applicable	34	0
CRAG	Truthfulness	Not applicable	20	0
Impact of adding RAG to LLMs.
CRAG	Accuracy	34	44	+10
CRAG	Truthfulness	20	20	0
Performance of State-of-the-Art Industry RAG solutions.
CRAG	Accuracy (Non-hallucinating)	44	63	+19

Experiment Figures

Web search recall curve relative to the number of retrieved pages

Breakdown of Truthfulness scores across dimensions: Domain, Dynamism, Popularity, and Question Type

Main Takeaways

Naive RAG improves accuracy but degrades truthfulness by introducing hallucinations from irrelevant retrieved content
Questions about 'Head' entities are handled much better than 'Torso' or 'Tail' entities; GPT-4 truthfulness drops from 21% (Head) to 8% (Tail)
Knowledge Graph access (Task 2) improves truthfulness over Web-only (Task 1) because structured data is more precise and less noisy
Dynamic and real-time facts (Finance/Sports) are significantly harder than static facts, with much lower truthfulness scores

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Knowledge Graphs (KG) and APIs
Basic knowledge of LLM hallucination and evaluation metrics

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Mock APIs: Simulated software interfaces that return structured data (e.g., stock prices) to mimic real-world tool usage in QA

KG: Knowledge Graph—a structured representation of facts (entities and relations)

Hallucination: Generated content that is factually incorrect or ungrounded

Head/Torso/Tail: Categories of entity popularity; 'Head' are very popular, 'Tail' are obscure/rare

Truthfulness: A metric defined in this paper: average score where Perfect=1, Acceptable=0.5, Missing=0, Incorrect=-1

Accuracy: Percentage of answers rated as Perfect or Acceptable (ignoring the penalty for Incorrect)