Retrieval Augmentation Reduces Hallucination in Conversation

📝 Paper Summary

Modularized RAG pipeline Factuality in Dialogue

The authors adapt neural retrieval-augmented generation (RAG) architectures for open-domain dialogue, demonstrating that conditioning generation on retrieved knowledge significantly reduces hallucination compared to standard large language models.

Core Problem

State-of-the-art dialogue models, despite their fluency, frequently 'hallucinate' knowledge—generating plausible but factually incorrect statements—because they rely solely on implicit weights rather than external grounding.

Why it matters:

Large language models (like GPT-3) mix up facts between similar entities or make subtle token errors that render statements false
Existing RAG methods work for QA but struggle with complex multi-turn dialogue contexts which require maintaining conversational flow alongside factuality

Concrete Example: When asked about 'Kyunghyun Cho', GPT-3 hallucinates that he is the 'most intelligent person on Earth', an 'ex-Go champion', and won awards he never won (e.g., NIPS 2013 Best Paper), whereas a retrieval-augmented model would ground the response in retrieved Wikipedia facts.

Key Novelty

Neural-Retriever-in-the-Loop for Dialogue

Adapts RAG and Fusion-in-Decoder (FiD) architectures specifically for multi-turn dialogue rather than just QA
Introduces 'RAG-Turn' to retrieve documents per dialogue turn, balancing local relevance with global context
Enhances retrieval via Poly-encoder re-ranking to allow finer-grained interaction between dialogue context and candidate documents

Evaluation Highlights

Reduces hallucinated responses by over 60% compared to standard large language models according to human evaluations on Wizard of Wikipedia
Achieves state-of-the-art F1 scores on Wizard of Wikipedia (Test Unseen) with the RAG DPR-Poly model (+3.1 F1 over non-augmented BART)
Demonstrates superior generalization: On out-of-distribution topics, Knowledge F1 gains are 85% over baselines, compared to 70% for in-distribution data

Breakthrough Assessment

8/10

Significant for establishing neural retrieval as a standard for reducing hallucination in dialogue. Successfully adapts QA-centric RAG/FiD to conversational settings with novel turn-based retrieval strategies.

⚙️ Technical Details

Problem Definition

Setting: Open-domain knowledge-grounded dialogue generation

Inputs: Dialogue context x_i (set of tokens from previous turns)

Outputs: Response sequence y_i (tokens grounded in retrieved knowledge)

Pipeline Flow

Context Encoding (Dialog turns x_1...x_T)
Retrieval (DPR / Poly-encoder retrieves documents Z)
Generation (Seq2seq model generates response y conditioned on x and Z)

System Modules

Retriever (Retrieval & Selection)

Retrieve relevant knowledge documents from Wikipedia based on dialogue context

Model or implementation: DPR (Dense Passage Retriever) or Poly-encoder

Re-ranker (Retrieval & Selection)

Refine the scoring of retrieved documents using finer-grained interaction

Model or implementation: Poly-encoder or ColBERT

Generator

Generate conversational response using context and retrieved documents

Model or implementation: BART-Large or T5-Large

Novel Architectural Elements

RAG-Turn: A retrieval strategy marginalizing over documents within specific dialogue turns before combining across turns
DPR-Poly: Augmenting RAG with a Poly-encoder re-ranker that combines dot-product scores with attention-based scores
ReGReT (Retrieve, Generate, Retrieve, Tune): An iterative scheme where a second retrieval step is conditioned on the first step's generated output
FiD-RAG: Using a RAG-trained retriever (which updates query encoders) within the fixed-retriever FiD architecture

Modeling

Base Model: BART-Large (400M params) and T5-Large (770M params)

Training Method: End-to-end training of retriever and generator (for RAG models)

Objective Functions:

Purpose: Minimize negative log-likelihood of the target tokens given context and retrieved documents.

Formally: Minimize -log p(y_i | x_i, Z)

Training Data:

Wizard of Wikipedia (WoW): 22,311 conversations
CMU Document Grounded Conversations (CMU_DoG): 4,112 conversations
Knowledge Source: KiLT Wikipedia dump

Key Hyperparameters:

retrieved_documents_k: 5 (standard), up to 25 for ablations
beam_size: Optimized on validation set
decoding_strategy: Beam search with minimum beam length and context blocking

Compute: Not reported in the paper

Comparison to Prior Work

vs. BlenderBot: Adds explicit neural retrieval loop to reduce hallucination
vs. GPT-2 Finetune: Uses RAG/FiD to access external knowledge rather than relying on weights
vs. KIF [not cited in paper]: Uses dense retrieval over all Wikipedia rather than attention over fixed knowledge selection

Limitations

RAG-Token models with many documents (e.g., 25) can increase hallucination despite better PPL, as they mix disjoint facts
Inference speed is slower than standard generative models due to the retrieval step (though Poly-encoder/FAISS mitigates this)
Requires a large, clean knowledge source (Wikipedia) which may not exist for all domains
Human evaluation shows engagingness can drop slightly as models become more grounded/factual

Reproducibility

Code: https://parl.ai/projects/hallucination/

Code and models publicly available at https://parl.ai/projects/hallucination/. Uses standard KiLT Wikipedia dump. Dataset splits (WoW, CMU_DoG) are standard or described (CMU_DoG modified split).

📊 Experiments & Results

Evaluation Setup

Open-domain knowledge-grounded dialogue using Wikipedia

Benchmarks:

Wizard of Wikipedia (WoW) (Knowledge-grounded dialogue)
CMU Document Grounded Conversations (CMU_DoG) (Movie-focused dialogue)

Metrics:

Perplexity (PPL)
F1 (Unigram overlap)
Knowledge F1 (Overlap with ground truth knowledge)
Rare F1 (Overlap on infrequent words)
BLEU-4
ROUGE-L
Human Evaluation (Consistency, Engagingness, Knowledgeable, Hallucination)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of RAG architectures against non-retrieval baselines on Wizard of Wikipedia (Test Unseen), showing benefits of retrieval.
WoW Test Unseen	F1	18.7	21.8	+3.1
WoW Test Unseen	Knowledge F1	15.0	24.3	+9.3
WoW Test Unseen	PPL	18.9	13.2	-5.7
Human evaluation results demonstrating the reduction of hallucination.
WoW Test Unseen	Hallucination Rate	68.2	9.6	-58.6
WoW Test Unseen	Knowledgeable	34.1	94.9	+60.8
Comparison of different RAG-Turn strategies on Valid Seen data.
WoW Valid Seen	F1	21.0	23.1	+2.1

Main Takeaways

Retrieval augmentation drastically reduces hallucination (from 68.2% to 9.6% in human eval) while maintaining conversational consistency.
Neural retrievers (DPR) outperform TFIDF significantly, and adding Poly-encoder re-ranking further boosts performance (State-of-the-Art on WoW).
Conditioning on knowledge is especially critical for generalization to unseen topics, where baselines without retrieval suffer large performance drops.
A nuance exists in document fusion: RAG-Token with many documents (25) improves PPL/F1 but increases hallucination compared to RAG-Sequence or using fewer documents, as it may mix unrelated facts.

📚 Prerequisite Knowledge

Prerequisites

Understanding of sequence-to-sequence (seq2seq) models (BART, T5)
Familiarity with Dense Passage Retrieval (DPR)
Basic knowledge of RAG and Fusion-in-Decoder (FiD) architectures

Key Terms

RAG: Retrieval-Augmented Generation—an architecture where a generator conditions on documents retrieved by a neural retriever

FiD: Fusion-in-Decoder—a method where the encoder processes documents independently, and the decoder attends to their concatenated representations

DPR: Dense Passage Retriever—a bi-encoder system using dot-product similarity between query and document vectors for retrieval

Poly-encoder: An architecture allowing late interaction between context and candidates using learned attention codes, balancing speed and expressiveness

Knowledge F1: A metric measuring unigram overlap between the model's generation and the ground-truth knowledge snippet (not just the reference response)

Rare F1: F1 score calculated only on words in the lower half of the dataset's cumulative frequency distribution to penalize safe, common responses

FAISS: Facebook AI Similarity Search—a library for efficient similarity search and clustering of dense vectors

RAG-Turn: A proposed variant where retrieval is performed for individual dialogue turns before joint marginalization, rather than just using the concatenated context

hallucination: The generation of text that is grammatically plausible but factually incorrect or nonsensical relative to the source/reality