← Back to Paper List

Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?

Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, Xin Luna Dong
Meta Reality Labs
arXiv (2023)
Factuality Benchmark KG QA

📝 Paper Summary

LLM Factuality Evaluation Knowledge Internalization
The paper introduces a benchmark to quantify LLM factuality, revealing that model accuracy significantly degrades from popular (head) to niche (tail) entities, with even advanced models struggling on the tail.
Core Problem
It is difficult to assess how much factual knowledge LLMs truly possess versus hallucinate, especially across the full distribution of entity popularity (from famous to obscure).
Why it matters:
  • LLMs are increasingly replacing Knowledge Graphs (KGs) for information seeking, but their reliability on long-tail facts is unknown
  • Existing benchmarks do not represent the uniform distribution of world knowledge or distinguish between popularity tiers
  • Hallucinations may stem from either a lack of parameterized knowledge or generative dysfunction, requiring a way to distinguish 'unsure' from 'wrong'
Concrete Example: When asked about a 'torso' or 'tail' entity (e.g., a less famous academic or movie), an LLM might confidently hallucinate biographical details rather than admitting ignorance, whereas it answers correctly for 'head' entities like Michael Jordan.
Key Novelty
Head-to-Tail Benchmark & Evaluation Protocol
  • Constructs a dataset of 18K QA pairs bucketed into Head, Torso, and Tail based on entity popularity (traffic/density), covering domains like Movies, Books, and Academics
  • Proposes metrics distinguishing Accuracy, Hallucination, and Missing rates, incentivizing models to output 'unsure' to measure true knowledge gaps
  • Uses an automated LLM-based evaluation (ChatGPT) that correlates highly with human judgment to scale the assessment of factual correctness
Evaluation Highlights
  • GPT-4 achieves only 48% accuracy on Head entities in the open domain, dropping significantly for Torso and Tail entities
  • Llama-2-70B shows a stark degradation in accuracy from Head to Tail, accompanied by an increasing hallucination rate
  • Instruction tuning (e.g., Vicuna vs. LLaMA) increases the 'Missing' rate (more 'unsure' answers) but does not necessarily improve factual accuracy
Breakthrough Assessment
8/10
Provides the first systematic quantification of the 'knowledge gap' in LLMs across popularity tiers, challenging the assumption that scaling alone solves factual recall for long-tail knowledge.
×