Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs

📝 Paper Summary

Factuality Evaluation Knowledge Graph (KG) Integration

GraphEval efficiently evaluates LLM factuality at scale by using a lightweight judge model to assess truthfulness on millions of declarative statements derived from Knowledge Graphs.

Core Problem

Existing factuality benchmarks are limited in scope, domain-specific, and computationally expensive because they require generating full text and meticulous human or heavy-model validation.

Why it matters:

Limited evaluation data restricts breadth, failing to cover the wide range of topics LLMs handle
High costs of generating and validating full-text responses make frequent large-scale evaluations infeasible
Small benchmarks carry risks of bias and data leakage, compromising evaluation validity

Concrete Example: For the triple (Barack Obama, birthPlace, Hawaii), asking a multiple-choice question might confuse the model with distractors. Asking for full generation requires expensive parsing. GraphEval converts this to 'Obama was born in Hawaii' and uses a lightweight judge to classify the LLM's hidden state response as True/False/IDK.

Key Novelty

KG-driven retrieval with lightweight judge model (GraphEval)

Utilizes entire Knowledge Graphs (like DBpedia) to generate millions of factual True/False prompts automatically, avoiding manual labeling
Replaces expensive text generation with a lightweight judge model that predicts 'True', 'False', or 'IDK' directly from the LLM's hidden states
Uses a prompt encoder to compress instruction prefixes, further reducing computational overhead for the judge

Architecture

The GraphEval framework workflow: from KG sampling to Judge Model training and final Evaluation.

Evaluation Highlights

Evaluates on 10 million facts from DBpedia, significantly larger than existing benchmarks like FELM or TruthfulQA
Judge model achieves high accuracy (implied by high alignment claims) while substantially reducing evaluation costs compared to generating full text
Demonstrates that judge model performance is robust across different LLM sizes (7B to 70B), allowing the use of smaller substitute models for efficient hidden state computation

Breakthrough Assessment

7/10

Significant scale-up for factuality evaluation (10M facts vs thousands) with a practical efficiency solution (judge model). However, relies on simple triple-based facts rather than complex reasoning.

⚙️ Technical Details

Problem Definition

Setting: Triple classification for factuality assessment using LLM hidden states

Inputs: A declarative statement derived from a KG triple (e.g., 'Obama was born in Hawaii') and the target LLM's hidden states

Outputs: Classification label: True, False, or I don't know (IDK)

Pipeline Flow

Data Generation: KG Triples → Declarative Statements (via GPT-4 templates) + Negative Sampling
Data Labeling: Small subset labeled by target LLM (True/False/IDK)
Judge Training: Train classifier on LLM hidden states + labels
Large-Scale Evaluation: Apply Judge to millions of KG statements

System Modules

Statement Generator

Converts KG triples into natural language sentences using relation-specific templates

Model or implementation: GPT-4 (for template creation), Rule-based filling

Prompt Encoder (Judge Model)

Compresses the instruction prefix into a continuous embedding to reduce input size for the judge

Model or implementation: P-tuning based encoder

Judge Classifier (Judge Model)

Predicts the correctness of the statement based on LLM internals

Model or implementation: 2-layer Feed-Forward Network (FFN) with LayerNorm and ReLU

Novel Architectural Elements

Use of a lightweight probe (Judge Model) on LLM hidden states specifically for factuality verification of KG triples, bypassing text generation
Integration of P-tuning prompt encoder solely to compress judge model input dimensionality

Modeling

Base Model: Evaluated on LLaMA-2 family (7B, 13B, 70B), Baichuan2-13B, ChatGLM3-6B

Training Method: Supervised training of the Judge Model (classifier)

Objective Functions:

Purpose: Minimize classification error of the judge.

Formally: Convex loss function L_D(h) minimizing misclassification rate between predicted hypothesis and true label.

Adaptation: P-tuning for prompt encoding

Trainable Parameters: Judge model weights (FFN) and Prompt Encoder embeddings

Training Data:

Subset of KG triples (DBpedia) converted to statements
Labels collected by querying the LLM on this subset to get ground truth behavior (True/False/IDK)

Compute: Significantly reduced compared to full generation (forward pass only + lightweight classifier). Exact training time not reported.

Comparison to Prior Work

vs. MMLU/TruthfulQA: GraphEval scales to 10M+ facts using KGs instead of limited hand-curated sets
vs. FELM: GraphEval uses automated KG extraction rather than diverse domain sampling
vs. External Tools: GraphEval uses internal hidden states and a judge model rather than external search engines or separate large verifier models
+ 2 more
vs. ITI (Inference-Time Intervention) [not cited in paper]: ITI also uses probes on hidden states but for steering; GraphEval uses them for evaluation/judgment
vs. RAGAS [not cited in paper]: RAGAS evaluates RAG pipelines using LLM-as-a-judge on generated text; GraphEval evaluates the model's internal knowledge directly via hidden states

Limitations

Relies on simple atomic facts (triples); may not evaluate complex reasoning or multi-hop factuality
Judge model requires training on a subset of data labeled by the specific target LLM, which might induce circularity or overfitting to that LLM's idiosyncrasies
Templates might restrict the diversity of how facts are presented compared to natural language variation
Negative sampling (random entity replacement) might generate easy-to-detect falsehoods compared to subtle hallucinations

Reproducibility

Code: https://github.com/xz-liu/GraphEval

Code is publicly available at https://github.com/xz-liu/GraphEval. The paper uses DBpedia (public). Templates for relations are generated by GPT-4 and manually refined.

📊 Experiments & Results

Evaluation Setup

Factuality evaluation on DBpedia Knowledge Graph

Benchmarks:

DBpedia (Knowledge Graph Fact Verification)

Metrics:

Correctness (Accuracy of True/False prediction)
Truthfulness (Likelihood of providing honest response: True or IDK)
Informativeness (Likelihood of offering substantive info: not IDK)
Statistical methodology: Theoretical analysis of generalization bound provided; empirical statistical significance tests not explicitly reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper focuses on the methodology and framework design. While it mentions evaluating on 10 million facts, specific comparative tables with baseline numbers for standard benchmarks (like MMLU) are not the primary focus; instead, it analyzes LLM performance across KG characteristics. The judge model's accuracy itself is implied to be high but exact validation numbers against a gold standard for the judge itself are not explicitly tabulated in the main text provided.

Experiment Figures

Comparison of GraphEval vs. conventional evaluation methods.

Main Takeaways

GraphEval allows scaling factuality evaluation to 10 million facts, offering a magnitude larger coverage than manually curated benchmarks
The Judge Model effectively surrogates the LLM's text generation, enabling cost-efficient large-scale evaluation
LLM hidden states are robust enough that a smaller 'substitute' model (e.g., LLaMA-7B) can be used to train the judge for a larger model (e.g., LLaMA-70B), further reducing compute
Analysis reveals LLM performance varies by relation type and entity popularity (degree/pageviews), with popular entities generally yielding higher factuality

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (structure of triples: head, relation, tail)
Large Language Models (hidden states, embeddings)
Prompt Tuning (P-tuning)

Key Terms

Knowledge Graph (KG): A structured representation of facts, typically as triples (entity, relation, entity), used here as a source of ground truth

Judge Model: A lightweight classifier (neural network) trained to predict whether an LLM knows a fact based on the LLM's internal hidden states

Negative Sampling: The process of creating false statements by corrupting a true triple (e.g., swapping the tail entity) to test if the model can identify falsehoods

P-tuning: A method to optimize continuous prompt embeddings instead of discrete text tokens, used here to compress instructions

Declarative Statement: A simple sentence stating a fact (e.g., 'Sky is blue') rather than a question (e.g., 'What color is the sky?'), used to simplify evaluation

Hidden States: The internal numerical representations (vectors) within an LLM that capture the model's processing of the input before generating output tokens

Substitute Model: Using a smaller version of an LLM (e.g., 7B) to compute hidden states for the judge, assuming representation alignment with larger versions (e.g., 70B)