← Back to Paper List

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang
Westlake University, Zhejiang University, Microsoft Research
arXiv (2023)
Factuality RAG Benchmark

📝 Paper Summary

Factuality Hallucination
This comprehensive survey defines the factuality issue in LLMs, distinguishes it from hallucination, analyzes error causes across model/retrieval/inference levels, and categorizes enhancement strategies for standalone and retrieval-augmented systems.
Core Problem
LLMs frequently generate content inconsistent with established facts, posing risks in high-stakes domains like medicine and law where reliability is critical.
Why it matters:
  • Incorrect information in search engines and chatbots can spread false beliefs or cause harm to users relying on them for decision-making
  • High-stakes domains (legal, medical, financial) require high factual accuracy; errors here can lead to lawsuits or jeopardize patient health
  • Existing surveys focus heavily on general hallucination but often overlook domain-specific factuality and the specific problem of outdated information
Concrete Example: A lawyer used ChatGPT to find legal precedents and submitted hallucinated case law to court, leading to sanctions. In another case, ChatGPT falsely claimed a specific law professor had sexually harassed a student during a class trip, citing a non-existent Washington Post article.
Key Novelty
Unified Taxonomy of LLM Factuality
  • Distinguishes 'Factuality' (consistency with established facts) from 'Hallucination' (nonsensical or untruthful generation), noting that hallucinations can sometimes be factual but unfaithful to the prompt
  • Categorizes error causes into three distinct levels: Model-Level (e.g., outdated info), Retrieval-Level (e.g., distracted by irrelevant context), and Inference-Level (e.g., snowballing)
  • Structures enhancement strategies into two streams: Standalone LLM improvements (pre-training, SFT, editing) and Retrieval-Augmented LLM improvements (interactive retrieval, retrieval adaptation)
Evaluation Highlights
  • Not reported in the paper (Survey paper summarizing other works)
  • Provides a qualitative taxonomy of evaluation metrics: Rule-based, Neural, Human, and LLM-based
  • Catalogs major benchmarks including MMLU, TruthfulQA, and domain-specific sets like CMB (Medicine) and FLARE (Finance)
Breakthrough Assessment
8/10
A highly structured and comprehensive survey that clarifies the muddy distinction between factuality and hallucination, offering a valuable roadmap for researchers.
×