Corrective Retrieval Augmented Generation

📝 Paper Summary

Modularized RAG pipeline Factuality

CRAG improves RAG robustness by using a lightweight evaluator to assess retrieved document quality and triggering different actions (Correct, Incorrect, Ambiguous) to refine, discard, or supplement information via web search.

Core Problem

Standard RAG systems indiscriminately incorporate retrieved documents regardless of relevance, causing hallucinations when retrieval fails or returns inaccurate results.

Why it matters:

Low-quality retrievers introduce irrelevant information that impedes models from acquiring accurate knowledge
Models can be misled by inaccurate retrieved documents, resulting in factual errors and hallucinations
A considerable portion of text within retrieved documents is often non-essential for generation and should not be equally referred to

Concrete Example: If a user asks 'Who is the CEO of Company X?' and the retriever returns an outdated document about a former CEO, a standard RAG model will likely hallucinate the wrong answer based on that irrelevant context. CRAG would detect the low relevance and trigger a web search for current information.

Key Novelty

Corrective Retrieval-Augmented Generation (CRAG)

Retrieval Evaluator: A lightweight model (T5-Large) that assesses the quality of retrieved documents and assigns a confidence score
Action Trigger: Determines whether to trust the documents (Correct), discard them and use web search (Incorrect), or combine both sources (Ambiguous) based on confidence thresholds
Knowledge Refinement: A decompose-then-recompose algorithm that filters out irrelevant strips of information within valid documents to focus only on key insights

Architecture

Overview of the CRAG inference process.

Evaluation Highlights

Outperforms standard RAG by 15.4% accuracy on Arc-Challenge and 36.6% accuracy on PubHealth when using SelfRAG-LLaMA2-7b generator
Improves state-of-the-art Self-RAG by 20.0% accuracy on PopQA and 36.9% FactScore on Biography when using LLaMA2-hf-7b generator
Demonstrates generalization across short-form (PopQA) and long-form (Biography) generation tasks with consistent gains

Breakthrough Assessment

8/10

Significant improvement in RAG robustness by explicitly addressing retrieval failures. The plug-and-play nature and the introduction of 'corrective' actions (like web search fallback) are highly practical contributions.

⚙️ Technical Details

Problem Definition

Setting: Given input X and a corpus C, retrieve documents D and generate output Y, ensuring robustness against inaccurate retrieval results.

Inputs: Input query X and retrieved documents D = {d_r1, ..., d_rk}

Outputs: Generated response Y

Pipeline Flow

Retriever (fetches initial documents)
Retrieval Evaluator (assigns confidence score)
Action Trigger (Selects Correct, Incorrect, or Ambiguous path)
Knowledge Refinement / Web Search (Optimizes context)
Generator (Produces final output)

System Modules

Retriever (Retrieval & Selection)

Retrieve top-K documents relevant to the input

Model or implementation: Contriever (used in experiments to match baselines)

Retrieval Evaluator (Retrieval & Selection)

Estimate relevance score of retrieved documents to input query

Model or implementation: T5-large (fine-tuned)

Action Trigger (Retrieval & Selection)

Determine processing path based on confidence scores

Model or implementation: Rule-based thresholds

Knowledge Refinement (Correct Action)

Extract critical knowledge strips from retrieved documents

Model or implementation: Decompose-then-recompose algorithm using Evaluator

Web Search (Incorrect Action) (Retrieval & Selection)

Search internet for complementary knowledge when internal retrieval fails

Model or implementation: Google Search API + ChatGPT (rewriter)

Generator

Generate final response using refined context

Model or implementation: Various LLMs (LLaMA2-7B, SelfRAG-LLaMA2-7B)

Novel Architectural Elements

Tri-state action trigger system (Correct/Incorrect/Ambiguous) driven by a lightweight evaluator
Integration of on-demand web search specifically when internal retrieval is deemed 'Incorrect' or 'Ambiguous'
Decompose-then-recompose algorithm for fine-grained knowledge filtering

Modeling

Base Model: Evaluator: T5-large; Generator: LLaMA2-7B / SelfRAG-LLaMA2-7B

Training Method: Fine-tuning (Evaluator)

Objective Functions:

Purpose: Train evaluator to predict relevance.

Formally: Standard supervised fine-tuning on relevance labels (positive from dataset gold links, negative from random retrieval samples).

Adaptation: Fine-tuning

Training Data:

Relevance signals from PopQA (golden subject wiki titles)
Negative samples randomly sampled from retrieval results

Key Hyperparameters:

model_size: 0.77B (Evaluator)
retrieval_count: 10 documents

Compute: Evaluator is lightweight (0.77B params) compared to LLMs

Comparison to Prior Work

vs. Self-RAG: CRAG focuses explicitly on correcting retrieval results via an external evaluator and web search, whereas Self-RAG relies on internal critique tokens and re-ranking
vs. Standard RAG: CRAG adds a corrective layer to filter irrelevant docs and seek external info, while standard RAG uses retrieved docs indiscriminately
vs. Toolformer: CRAG is a plug-and-play module for existing RAG systems rather than a model pre-trained specifically for tool use
+ 1 more
vs. ITER-RETGEN [not cited in paper]: Iterative retrieval-generation, whereas CRAG is a single-pass corrective step

Limitations

Reliant on the quality of the web search API and the availability of internet access
Latency may increase due to the extra evaluation step and potential web search
Evaluator accuracy is critical; if the evaluator fails, the system may discard good documents or keep bad ones

Reproducibility

Code: https://github.com/HuskyInSalt/CRAG

Code available at github.com/HuskyInSalt/CRAG. Uses public datasets (PopQA, Biography, PubHealth, Arc-Challenge). Uses Google Search API for web search.

📊 Experiments & Results

Evaluation Setup

Open-domain QA and generation tasks

Benchmarks:

PopQA (Short-form entity generation)
Biography (Long-form generation)
PubHealth (True-or-false question)
Arc-Challenge (Multiple-choice question)

Metrics:

Accuracy (PopQA, PubHealth, Arc-Challenge)
FactScore (Biography)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CRAG consistently improves performance when applied to Self-RAG base models across all datasets.
PopQA	Accuracy	51.6	58.6	+7.0
Biography	FactScore	63.5	78.4	+14.9
PubHealth	Accuracy	39.0	75.6	+36.6
Arc-Challenge	Accuracy	33.7	49.1	+15.4
Comparison against State-of-the-Art Self-RAG (which uses critic tokens) using LLaMA2-hf-7b as the generator base.
PopQA	Accuracy	29.2	49.2	+20.0
PopQA	Accuracy	59.3	56.7	-2.6

Main Takeaways

CRAG significantly improves robustness of RAG systems by actively correcting retrieval failures.
The method is plug-and-play and works with both standard LLaMA models and specialized models like Self-RAG.
The 'Ambiguous' action is crucial; forcing a binary Correct/Incorrect decision degrades performance, showing the value of a soft fallback.
Knowledge refinement (decompose-then-recompose) and web search integration are essential components, as removing either leads to performance degradation.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) framework
Basic understanding of language models and fine-tuning
Information Retrieval concepts (relevance scoring)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Self-RAG: A framework where a model learns to retrieve, generate, and critique its own output using special reflection tokens

CRAG: Corrective Retrieval Augmented Generation—the proposed method that evaluates and corrects retrieval results

FactScore: An evaluation metric that breaks down a generation into atomic facts and verifies how many are supported by a knowledge source (e.g., Wikipedia)

T5-large: A text-to-text transfer transformer model (~770M parameters) used here as the lightweight retrieval evaluator

Contriever: A dense information retrieval model used to fetch relevant documents from a corpus

PopQA: A dataset for short-form entity generation tasks

Arc-Challenge: A multiple-choice question dataset requiring reasoning

PubHealth: A true-or-false question dataset focused on public health claims

Decompose-then-recompose: A method to split documents into smaller strips, evaluate each strip's relevance, and reconstruct a clean context by concatenating only the relevant parts