Systematic Assessment of Factual Knowledge in Large Language Models

📝 Paper Summary

Factual Knowledge Evaluation Hallucination Detection

The paper proposes a framework to evaluate factual knowledge in LLMs by automatically generating diverse questions from knowledge graphs and measuring performance using an F1 metric that accounts for model abstention.

Core Problem

Existing benchmarks for evaluating LLM factual knowledge focus on generic domains that likely overlap with pretraining data, and constructing domain-specific benchmarks manually is costly and lacks systematic coverage.

Why it matters:

Extrinsic hallucinations (generating unverifiable statements) severely impair LLM trustworthiness in critical decision-making applications
Benchmarks constructed from public datasets pose information leakage problems due to overlap with pretraining corpora
Current evaluations often fail to distinguish between a model not knowing an answer (abstention) and hallucinating a wrong answer

Concrete Example: When asked 'Where was Barack Obama born?' an LLM might answer correctly. However, if the prompt includes a false context like 'Barack Obama was born in Miami,' the model might be misled into repeating the misinformation instead of relying on its parametric knowledge.

Key Novelty

Knowledge Graph-Driven Assessment Framework

Systematically converts knowledge graph triplets (Subject, Relation, Object) into diverse question formats (True/False, Multiple Choice, Short Answer) to ensure complete coverage of facts
Introduces a modified F1 metric that treats 'abstention' (refusal to answer) distinctively from incorrect answers, rewarding models for knowing what they don't know

Architecture

The systematic assessment framework workflow

Evaluation Highlights

ChatGPT achieves the highest average F1 score (74.00) on the T-REx general domain dataset, consistently outperforming LLaMA and T5 families
LLMs are highly sensitive to adversarial context: relevant context improves performance, but anti-factual context significantly misleads models (e.g., ChatGPT precision drops when false context is provided)
Instruction-tuned models (Alpaca, Flan-T5) consistently outperform their base models (LLaMA-7B, T5-XL) on factual questions, suggesting instruction tuning unlocks knowledge access

Breakthrough Assessment

7/10

Provides a solid, systematic framework for evaluating factuality using KGs and highlights critical robustness issues. The distinction between abstention and error via F1 is a valuable methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Question Answering based on factual triplets

Inputs: A generated question q derived from a knowledge graph triplet (s, r, o) and optional context c

Outputs: A text response answering the question or an abstention message

Pipeline Flow

Knowledge Graph Selection (Google-RE, T-REx, WikiBio, UMLS)
Question Generation (Template-based & LLM-based via ChatGPT)
Prompt Construction (Adding instructions & optional context)
LLM Inference (Generating answers)
Answer Evaluation (Fuzzy matching & Abstention detection)

System Modules

Question Generator

Converts KG triplets into three question types: True-False (TFQ), Multiple Choice (MCQ), and Short Answer (SAQ)

Model or implementation: GPT-3.5-turbo or Template-based

LLM Evaluator (Evaluation)

Target model being assessed for factual knowledge

Model or implementation: Various (ChatGPT, LLaMA, T5 family)

Metric Calculator (Evaluation)

Determines correctness and handles abstentions

Model or implementation: Rule-based scripts

Novel Architectural Elements

Systematic pipeline generating questions from ALL explicit facts in a KG to ensure coverage, rather than sampling
Integration of abstention detection into the F1 metric calculation to penalize confidence errors vs. safety refusals differently

Modeling

Base Model: Evaluated multiple families: LLaMA-7B, T5-XL, ChatGPT (GPT-3.5-turbo)

Comparison to Prior Work

vs. LAMA: Generates diverse natural language questions (Wh-, T/F, MCQ) via LLMs/templates rather than just cloze fill-in-the-blank tasks
vs. HELM: Focuses specifically on systematic factual coverage using exhaustive KG triplets rather than sampling existing QA datasets
vs. Atlas [not cited in paper]: Atlas focuses on few-shot retrieval-augmented performance, whereas this paper assesses parametric knowledge in zero-shot settings without external retrieval

Limitations

Assumes the knowledge graph is complete, ignoring implicit facts
Focuses only on single-triplet facts, ignoring complex reasoning across multiple triplets
Evaluation of questions with multiple correct answers (N-M relations) is challenging if the model doesn't return all of them
Dependency on GPT-3.5 for question generation might introduce bias or errors in the evaluation set itself

Reproducibility

Code: https://github.com/RManLuo/llm-facteval

📊 Experiments & Results

Evaluation Setup

Zero-shot question answering across generic and specific domains

Benchmarks:

Google-RE (General Domain QA (Place of birth, Date of birth, Place of death))
T-REx (General Domain QA (Wikipedia subsets))
WikiBio (Biology Domain QA)
UMLS (Medical Domain QA)

Metrics:

Precision (accuracy of non-abstained answers)
Recall (accuracy over all questions)
F1 score (harmonic mean of Precision and Recall, accounting for abstention)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison across different Knowledge Graphs shows ChatGPT consistently leading, with general domains scoring higher than specialized ones.
T-REx	F1	61.00	74.00	+13.00
WikiBio	F1	36.60	62.74	+26.14
Google-RE	F1	35.77	35.77	0.00
Google-RE	Precision (TFQ-GPT3.5)	1.46	77.43	+75.97
Ablation on robustness to adversarial context reveals high sensitivity to anti-factual information.
Google-RE	Precision (Approximate from visual)	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Impact of different context types (none, relevant, irrelevant, antifactual) on F1 scores for Google-RE

Main Takeaways

ChatGPT consistently outperforms other models (LLaMA family, T5 family) across all domains (General, Biology, Medical)
Instruction tuning is critical for extracting factual knowledge; base models (LLaMA-7B, T5-XL) perform near random guessing on formatted questions
Performance drops significantly in specialized domains (UMLS, WikiBio) compared to general domains (T-REx), indicating limits of pretraining coverage
Models are highly sensitive to prompt context: relevant context boosts performance, but anti-factual context causes models to hallucinate, overriding their internal knowledge
FLAN-T5-XL tends to abstain frequently (high precision, low recall), prioritizing safety over answering, whereas ChatGPT attempts more answers

📚 Prerequisite Knowledge

Prerequisites

Understanding of Knowledge Graphs (entities, relations, triplets)
Familiarity with Large Language Models and instruction tuning
Basic knowledge of evaluation metrics (Precision, Recall, F1)

Key Terms

Knowledge Graph (KG): A structured representation of facts, often stored as triplets (subject, relation, object)

Abstention: When an LLM refuses to answer a question (e.g., 'I don't know') rather than hallucinating an incorrect answer

Extrinsic Hallucinations: Generation of statements that cannot be verified from the source or contradict factual knowledge

Triplet: The fundamental unit of data in a knowledge graph, consisting of two entities and the relationship between them (e.g., Obama, born_in, Hawaii)

Parametric Knowledge: Knowledge stored within the model's weights during pre-training, as opposed to knowledge provided in the prompt context

T-REx: A large-scale alignment of natural language with knowledge base triplets, used as a general-domain dataset

Instruction Finetuning: Training a base model on datasets of instructions and responses to improve its ability to follow user commands