← Back to Paper List

Survey on Factuality in Large Language Models

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Qipeng Guo, Xiangkun Hu, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Xuming Hu, Zehan Qi, Wenyang Gao, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang
Westlake University, Zhejiang University, Microsoft Research
ACM Computing Surveys (2025)
Factuality RAG Benchmark Pretraining

📝 Paper Summary

Factuality Evaluation Factuality Enhancement Hallucination Analysis
This survey provides a comprehensive taxonomy of factuality in LLMs, distinguishing it from hallucination, analyzing error causes across model/retrieval/inference levels, and categorizing enhancement strategies for standalone and retrieval-augmented systems.
Core Problem
LLMs frequently generate content inconsistent with established facts (factuality issues), ranging from domain knowledge deficits to outdated information and reasoning failures.
Why it matters:
  • Factual errors in high-stakes domains like medicine and law can cause irreversible harm (e.g., wrong medical advice, fabricated legal precedents).
  • Reliability is essential for integration into search engines and agents; a single factual mistake by an autonomous agent could cause physical damage.
  • Existing surveys often conflate factuality with hallucination or overlook specific sub-problems like outdated information and domain specificity.
Concrete Example: When asked 'When was Kyiv attacked by Russia?', ChatGPT (Sep 2021 version) claims Russia had not attacked, missing the Feb 2022 event (outdated information). Similarly, a lawyer faced sanctions after using ChatGPT to generate a legal brief that cited non-existent cases (hallucinated factuality error).
Key Novelty
Unified Factuality Taxonomy
  • Distinguishes 'Factuality' from 'Hallucination': Factuality focuses on consistency with established reality (e.g., 'I don't know' is factual but not useful), whereas Hallucination encompasses any unfaithful generation.
  • Categorizes error causes into three distinct levels: Model-Level (storage/reasoning), Retrieval-Level (misinformation/distraction), and Inference-Level (snowballing/decoding errors).
  • Structurally divides enhancement strategies into Standalone LLM approaches (pretraining, SFT, editing) and Retrieval-Augmented approaches (interactive, adaptive, structured data retrieval).
Evaluation Highlights
  • Surveys benchmarks like MMLU (Measuring Massive Multitask Language Understanding) and TruthfulQA, highlighting that models like GPT-4 outperform predecessors by >20 points on medical exams (USMLE) without fine-tuning.
  • Identifies that retrieval augmentation is critical for outdated information: standalone models consistently fail on post-training events (e.g., Kyiv attack date).
  • Highlights domain-specific gaps: General LLMs often fail on specialized queries (e.g., BloombergGPT is needed for finance factuality where general models hallucinate CEO names).
Breakthrough Assessment
9/10
A highly comprehensive survey that creates a necessary distinction between factuality and hallucination, offering a structured roadmap for an increasingly critical field.
×