Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

📝 Paper Summary

Factuality Hallucination

This comprehensive survey defines the factuality issue in LLMs, distinguishes it from hallucination, analyzes error causes across model/retrieval/inference levels, and categorizes enhancement strategies for standalone and retrieval-augmented systems.

Core Problem

LLMs frequently generate content inconsistent with established facts, posing risks in high-stakes domains like medicine and law where reliability is critical.

Why it matters:

Incorrect information in search engines and chatbots can spread false beliefs or cause harm to users relying on them for decision-making
High-stakes domains (legal, medical, financial) require high factual accuracy; errors here can lead to lawsuits or jeopardize patient health
Existing surveys focus heavily on general hallucination but often overlook domain-specific factuality and the specific problem of outdated information

Concrete Example: A lawyer used ChatGPT to find legal precedents and submitted hallucinated case law to court, leading to sanctions. In another case, ChatGPT falsely claimed a specific law professor had sexually harassed a student during a class trip, citing a non-existent Washington Post article.

Key Novelty

Unified Taxonomy of LLM Factuality

Distinguishes 'Factuality' (consistency with established facts) from 'Hallucination' (nonsensical or untruthful generation), noting that hallucinations can sometimes be factual but unfaithful to the prompt
Categorizes error causes into three distinct levels: Model-Level (e.g., outdated info), Retrieval-Level (e.g., distracted by irrelevant context), and Inference-Level (e.g., snowballing)
Structures enhancement strategies into two streams: Standalone LLM improvements (pre-training, SFT, editing) and Retrieval-Augmented LLM improvements (interactive retrieval, retrieval adaptation)

Evaluation Highlights

Not reported in the paper (Survey paper summarizing other works)
Provides a qualitative taxonomy of evaluation metrics: Rule-based, Neural, Human, and LLM-based
Catalogs major benchmarks including MMLU, TruthfulQA, and domain-specific sets like CMB (Medicine) and FLARE (Finance)

Breakthrough Assessment

8/10

A highly structured and comprehensive survey that clarifies the muddy distinction between factuality and hallucination, offering a valuable roadmap for researchers.

⚙️ Technical Details

Problem Definition

Setting: Evaluation and enhancement of Large Language Models regarding their adherence to factual information

Inputs: Natural language prompts requiring factual knowledge (commonsense, world knowledge, domain facts)

Outputs: Generated text responses

Pipeline Flow

Taxonomy Definition (Factuality vs. Hallucination)
Error Analysis (Model, Retrieval, Inference levels)
Evaluation Framework (Metrics, Benchmarks)
Enhancement Strategies (Standalone, RAG)

System Modules

Factuality Definition

Define the scope of factuality and distinguish it from hallucination, outdated info, and domain specificity

Model or implementation: Conceptual Framework

Error Analysis

Identify root causes of errors

Model or implementation: Conceptual Framework

Enhancement (Standalone) (Enhancement)

Summarize methods to improve internal knowledge

Model or implementation: Various (Pre-training, SFT, RLHF, Decoding)

Enhancement (RAG) (Enhancement)

Summarize methods to improve retrieval-augmented systems

Model or implementation: Various (Interactive Retrieval, Retrieval Adaptation)

Novel Architectural Elements

Structured taxonomy separating error causes into Model-Level, Retrieval-Level, and Inference-Level
Separation of enhancement strategies into Standalone vs. Retrieval-Augmented tracks

Modeling

Base Model: Covers multiple models including GPT-4, ChatGPT, LLaMA, PaLM, BloombergGPT, etc.

Training Method: Survey of existing methods (SFT, RLHF, Model Editing)

Adaptation: Survey covers LoRA, Model Editing, etc.

Compute: Not applicable (Survey paper)

Comparison to Prior Work

vs. Chang et al. (2023) and Wang et al. (2023i): This survey focuses specifically on the *factuality* aspect rather than general evaluation, covering domain-specific issues and outdated information more deeply
vs. Ji et al. (2023a) and Rawte et al. (2023): This survey differentiates factuality from hallucination (hallucination can be factual but unfaithful; factuality is strictly about truthfulness), whereas prior surveys often conflate them

Limitations

As a survey, it does not propose a new model or report new experimental results
The definition of factuality relies on 'established facts' which can be ambiguous or subject to change
The fast pace of LLM development means some cited benchmarks or methods may quickly become dated

Reproducibility

Code: https://github.com/wangcunxiang/LLM-Factuality-Survey

The authors maintain a GitHub repository (https://github.com/wangcunxiang/LLM-Factuality-Survey) with related open-source materials, paper lists, and updates.

📊 Experiments & Results

Evaluation Setup

Survey of evaluation methodologies used in the field

Benchmarks:

MMLU (General Knowledge (57 subjects))
TruthfulQA (Factuality/Hallucination detection)
C-Eval (Chinese Language Evaluation)
FreshQA (Dynamic/Changing Facts QA)
CMB (Medical Domain)
FLARE (Financial Domain)

Metrics:

Rule-based metrics (Exact Match)
Neural metrics
Human evaluation
LLM-based evaluation (LLM-as-a-judge)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Factuality issues differ from hallucinations; hallucinations can be factually correct but unfaithful to the prompt, while factuality errors are strictly about incorrect information.
Error causes are multi-faceted: Model-level (forgetting, outdated info), Retrieval-level (distracting info), and Inference-level (snowballing).
Enhancement strategies are bifurcated: Standalone models need continual pre-training or editing, while RAG systems need better retrieval adaptation and interactive mechanisms.
Domain specificity is a major sub-field, with specialized benchmarks and models emerging for Law, Medicine, and Finance to address high-stakes factuality.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and their training pipeline (Pre-training, SFT, RLHF)
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of evaluation metrics in NLP

Key Terms

Factuality: The probability of LLMs producing content consistent with established facts (grounded in reliable sources)

Hallucination: The tendency of models to produce content that is nonsensical or untruthful in relation to sources, or unfaithful to the prompt (even if factually correct)

Snowballing: An inference-level error where an initial incorrect generation leads the model to produce further consistent but incorrect information

Exposure Bias: A discrepancy between training (ground truth available) and inference (model relies on own predictions), potentially leading to error propagation

RAG: Retrieval-Augmented Generation—systems that retrieve external documents to ground generation

SFT: Supervised Fine-Tuning—training on labeled examples to align the model

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps

Model Editing: Techniques to directly update specific facts within the model's parameters without full re-training

Standalone LLMs: LLMs that rely solely on their internal parametric knowledge without external retrieval

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects to test world knowledge

TruthfulQA: A benchmark specifically designed to test whether models mimic human falsehoods or generate truthful answers