Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?

📝 Paper Summary

LLM Factuality Evaluation Knowledge Internalization

The paper introduces a benchmark to quantify LLM factuality, revealing that model accuracy significantly degrades from popular (head) to niche (tail) entities, with even advanced models struggling on the tail.

Core Problem

It is difficult to assess how much factual knowledge LLMs truly possess versus hallucinate, especially across the full distribution of entity popularity (from famous to obscure).

Why it matters:

LLMs are increasingly replacing Knowledge Graphs (KGs) for information seeking, but their reliability on long-tail facts is unknown
Existing benchmarks do not represent the uniform distribution of world knowledge or distinguish between popularity tiers
Hallucinations may stem from either a lack of parameterized knowledge or generative dysfunction, requiring a way to distinguish 'unsure' from 'wrong'

Concrete Example: When asked about a 'torso' or 'tail' entity (e.g., a less famous academic or movie), an LLM might confidently hallucinate biographical details rather than admitting ignorance, whereas it answers correctly for 'head' entities like Michael Jordan.

Key Novelty

Head-to-Tail Benchmark & Evaluation Protocol

Constructs a dataset of 18K QA pairs bucketed into Head, Torso, and Tail based on entity popularity (traffic/density), covering domains like Movies, Books, and Academics
Proposes metrics distinguishing Accuracy, Hallucination, and Missing rates, incentivizing models to output 'unsure' to measure true knowledge gaps
Uses an automated LLM-based evaluation (ChatGPT) that correlates highly with human judgment to scale the assessment of factual correctness

Evaluation Highlights

GPT-4 achieves only 48% accuracy on Head entities in the open domain, dropping significantly for Torso and Tail entities
Llama-2-70B shows a stark degradation in accuracy from Head to Tail, accompanied by an increasing hallucination rate
Instruction tuning (e.g., Vicuna vs. LLaMA) increases the 'Missing' rate (more 'unsure' answers) but does not necessarily improve factual accuracy

Breakthrough Assessment

8/10

Provides the first systematic quantification of the 'knowledge gap' in LLMs across popularity tiers, challenging the assumption that scaling alone solves factual recall for long-tail knowledge.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (QA) assessing factual knowledge across popularity distributions

Inputs: Simple factual questions (e.g., 'Where was X born?')

Outputs: Concise textual answer or 'unsure'

Pipeline Flow

Entity Selection & Bucketing (Head/Torso/Tail)
Question Generation (Templates)
Model Inference (QA)
Automated Evaluation (ChatGPT Judge)

System Modules

Benchmark Construction

Select entities from sources (DBpedia, IMDb, etc.), calculate popularity, and generate template-based questions

Model or implementation: Script-based + ChatGPT for template drafting

Inference Engine (Evaluation)

Generate answers to benchmark questions with instructions to be concise and admit ignorance

Model or implementation: Various LLMs (e.g., GPT-4, Llama-2-70B)

Evaluator (Evaluation)

Judge correctness of answers against ground truth

Model or implementation: ChatGPT (gpt-3.5-turbo-0301)

Novel Architectural Elements

Head-to-Tail bucketing logic: Partitioning entities based on cumulative popularity (traffic or density) to create distinct evaluation tiers

Modeling

Base Model: Evaluation covers 16 LLMs including GPT-4, ChatGPT, Llama 2 (70B), LLaMA (7B-65B), Vicuna, Flan-T5, RWKV, Falcon

Training Method: Not applicable (Evaluation paper)

Adaptation: None (Inference only)

Trainable Parameters: None (Frozen models)

Compute: Inference performed on A100 (80GB) GPUs using float16/bfloat16 formats

Comparison to Prior Work

vs. PopQA [not cited in paper]: Head-to-Tail uses multi-domain sources (IMDb, Goodreads) and traffic-based popularity, not just Wikipedia page views
vs. TruthfulQA [not cited in paper]: Focuses on factual recall across popularity tiers rather than mimicked falsehoods or misconceptions
Novelty: First benchmark specifically designed to measure knowledge retention across Head, Torso, and Tail distributions explicitly

Limitations

Evaluation proxy relies on simple questions, which may not capture robust understanding or reasoning
Does not assess taxonomy or type hierarchy knowledge
Traffic/density metrics are approximations of popularity and may vary by source
Limited to entities existing before 2020-2022 to avoid recency bias, but this excludes newest knowledge

Reproducibility

Code: https://github.com/facebookresearch/head-to-tail

Benchmark dataset publicly available at https://github.com/facebookresearch/head-to-tail. Code for evaluation metrics provided. Specific prompt templates included in Appendix. Model weights for open models (Llama, Vicuna, etc.) are standard public releases.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot QA on 18K questions across 4 domains (Movie, Book, Academics, Open/DBpedia)

Benchmarks:

Head-to-Tail (Factual Question Answering) [New]

Metrics:

Accuracy (A_LM): Correctness judged by ChatGPT
Hallucination Rate (H_LM): Incorrect answers excluding 'unsure'
Missing Rate (M): 'Unsure' or empty answers
Exact Match (EM)
Token F1
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Head-to-Tail (Overall)	Accuracy (A_LM)	16.8	31.0	+14.2
Head-to-Tail (Overall)	Accuracy (A_LM)	26.3	31.0	+4.7
Performance breakdown by popularity bucket shows consistent degradation from Head to Tail for GPT-4.
Head-to-Tail (Open Domain)	Accuracy (A_LM)	17.4	47.6	+30.2
Head-to-Tail (Open Domain)	Accuracy (A_LM)	34.6	47.6	+13.0
Comparison of Super-Head (Top 10%) vs General Head entities.
Head-to-Tail (Top 10% Head)	Accuracy (A_LM)	41.6	46.2	+4.6
Impact of model size on factuality (LLaMA comparison).
Head-to-Tail	Accuracy (A_LM)	5.3	14.7	+9.4

Main Takeaways

Factuality follows a Power Law: Accuracy drops consistently from Head to Torso to Tail across all models tested.
Even 'Head' knowledge is shaky: GPT-4 only answers ~48% of questions correctly about popular entities in the open domain.
Instruction Tuning effect: Instruction-tuned models (Vicuna) tend to be more conservative (higher Missing Rate) than base models (LLaMA), but do not necessarily have higher factual accuracy.
Domain sensitivity: Performance is highest in popular domains like Movies and lowest in niche domains like Academics.
Prompt robustness: Asking models to be 'unsure' significantly reduces hallucination rates compared to forcing an answer.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Knowledge Graphs (entities, predicates, triples)
Familiarity with LLM hallucination vs. uncertainty
Basic concepts of entity popularity (Head/Torso/Tail distributions)

Key Terms

Head/Torso/Tail: Buckets of entity popularity; 'Head' are the most popular (top 33% cumulative popularity), 'Tail' are the least popular (bottom 33%)

Hallucination Rate (H): Percentage of questions where the model provides a confident but incorrect answer

Missing Rate (M): Percentage of questions where the model answers 'unsure' or provides no answer

Accuracy (A): Percentage of questions where the model provides the correct answer

LLM-as-a-judge: Using a strong LLM (like ChatGPT) to evaluate the correctness of another model's response

Knowledge Graph (KG): Structured representation of knowledge in subject-predicate-object triplets

Instruction Tuning: Training LLMs on dataset of instructions to improve their ability to follow user commands

Zero-shot/Few-shot: Prompting strategies where the model is given zero or a few examples of the task before being asked to solve it