Factual Knowledge Assessment of Language Models Using Distractors

📝 Paper Summary

Factual Knowledge Assessment Hallucination detection

The authors propose assessing an LM's factual knowledge by checking if it assigns higher probability to the correct entity than to a set of carefully selected incorrect 'distractor' entities.

Core Problem

Existing methods for assessing factual knowledge (like cloze sentences) suffer from 'out-of-subject' continuations where the model generates grammatically correct but irrelevant text, or struggle with multiple correct verbalizations.

Why it matters:

Verifying that a model generates 'Poland' after 'Germany shares a border with' is insufficient if the model merely guessed or if valid alternatives exist
Current metrics like Precision@k fail for multi-token answers, while probability-based metrics are hard to interpret or compare across models
Verbalization artifacts (e.g., missing determiners in templates) often break existing assessment metrics, leading to false negatives in knowledge assessment

Concrete Example: For the fact (France, capital, Paris), a cloze sentence 'The capital of France is...' might be continued with 'a city of contrasts' (Out-Of-Subject) instead of 'Paris'. While linguistically valid, this continuation fails to reveal if the model actually knows the capital is Paris.

Key Novelty

Distractor-Based Knowledge Assessment (Min@n / Avg@n)

Instead of generating text, the method compares the probability of the correct object against a set of 'distractors' (incorrect but plausible alternatives of the same type)
A fact is considered 'known' only if the model assigns higher plausibility to the correct object than to the distractors (specifically using a Minimum aggregation function for strictness)
Introduces multiple strategies for retrieving distractors, finding that 'Approximate Optimal' (ApprOpt) distractors—those the model itself considers highly probable—are the hardest and most effective

Architecture

Comparison of distractor strategies (Random, Temp+Sem, ApprOpt) by measuring how 'hard' they are for the model (lower Min@n score = harder).

Evaluation Highlights

The proposed measure (using ApprOpt distractors) achieves a Kendall's τ correlation of 0.282 with human judgment, outperforming BERT-score (0.159) and Precision@n (0.185)
The method is highly robust to verbalization artifacts (e.g., template errors), achieving a consistency score (Kendall's τ) of 0.92 between correct and flawed prompts, compared to 0.47 for ROUGE-L
Larger models are slightly more robust to distractors, but even a 12B model only reaches ~0.80 Avg@20 score, suggesting LMs remain vulnerable to plausible incorrect alternatives

Breakthrough Assessment

7/10

Offers a robust, interpretable alternative to generation-based evaluation. While not a fundamental architecture change, it significantly improves the reliability of knowledge assessment methodologies.

⚙️ Technical Details

Problem Definition

Setting: Given a fact triple f=(s, r, o) and a cloze sentence c derived from (s, r), determine if the language model 'knows' f.

Inputs: Fact triple (Subject, Relation, Object) and a corresponding cloze sentence template.

Outputs: A scalar score K(f) indicating knowledge presence (typically binary 0/1 or continuous 0-1 depending on aggregation).

Pipeline Flow

Fact Verbalization: Convert (s, r) to cloze sentence c
Distractor Retrieval: Fetch n incorrect entities o* (via ApprOpt, Temp+Sem, etc.)
Plausibility Calculation: Compute probabilities for correct object o and distractors o*
Comparison & Aggregation: Compare Pl(o|c) vs Pl(o*|c) to compute K(f)

System Modules

Verbalizer

Maps fact triples to natural language templates

Model or implementation: Template-based system (augmented via GPT-3.5)

Distractor Retriever

Identifies n incorrect but plausible entities o*

Model or implementation: Various strategies (ApprOpt, Semantic, Random)

Scorer

Calculates the knowledge score K(f)

Model or implementation: Algorithmic comparison

Novel Architectural Elements

Integration of an 'Approximate Optimal' distractor search that uses the LM's own generation probabilities (constrained decoding) to find the 'hardest' incorrect answers
Evaluation framework comparing Plausibility of correct vs. incorrect entities rather than raw generation text matching

Modeling

Base Model: Pythia-6.9B (primary model for experiments), also Pythia-70M through Pythia-12B

Training Method: Not applicable — paper evaluates pre-trained models

Trainable Parameters: None (evaluation only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Precision@n: Precision@n fails for multi-token answers; Distractors handle multi-token entities via summing label probabilities
vs. LLM-as-a-judge: Distractors are computationally cheaper (no secondary inference call for judging) and more interpretable
vs. LAMA: Uses comparisons against specific negative examples (distractors) rather than rank within the entire vocabulary
+ 1 more
vs. Kassner et al. (2021) [not cited in paper]: Kassner uses a fixed small set of distractors; this work dynamically retrieves distractors from millions of Wikidata entities

Limitations

Requires a comprehensive Knowledge Base (Wikidata) to identify valid distractors
Human validation was performed on a limited set of annotations (subset of 210 facts)
Does not account for the potential informativeness of Out-Of-Subject continuations (treats them as zero info)
The KaRR metric baseline could not be reproduced to match reported literature performance

Reproducibility

Code: https://github.com/Orange-OpenSource/DistFactAssessLM

Code and data available at https://github.com/Orange-OpenSource/DistFactAssessLM. Uses Pythia model suite. Dataset derived from Wikidata dump 2021-01-04. Templates augmented via GPT-3.5.

📊 Experiments & Results

Evaluation Setup

Knowledge assessment on Wikidata facts using cloze sentences

Benchmarks:

Wikidata Facts (Subset) (Fact Verification / Cloze Completion) [New]

Metrics:

Kendall's τ (correlation with human judgment)
Min@n (Robustness score against n distractors)
Robustness to Verbalization Artifacts (correlation between correct/flawed prompts)
Statistical methodology: Cross-validation with K=3 folds. Wilson's 95% confidence intervals reported for error analysis.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation with human judgment (Kendall's τ). Higher is better. 'Dist. ApprOpt' is the proposed method.
Wikidata Subset	Kendall's τ	0.185	0.282	+0.097
Wikidata Subset	Kendall's τ	0.159	0.282	+0.123
Wikidata Subset	Kendall's τ	0.293	0.282	-0.011
Robustness to Verbalization Artifacts. Measures consistency (Kendall's τ) when assessing the same fact with a correct vs. slightly flawed template. Higher is better.
Verbalization Errors Dataset	Kendall's τ (Consistency)	0.47	0.92	+0.45
Verbalization Errors Dataset	Kendall's τ (Consistency)	0.63	0.92	+0.29

Experiment Figures

Robustness of different Pythia model sizes (70M to 12B) against ApprOpt distractors

Main Takeaways

ApprOpt (Approximate Optimal) is the most effective distractor strategy because it selects incorrect answers the model *already* thinks are likely, providing a stricter test
Temporal distractors (e.g., past presidents) are harder than random semantic distractors, but only ~1% of Wikidata facts have them, limiting their general utility
Model size improves robustness to distractors logarithmically; even Pythia-12B is easily fooled (Avg@20 ~0.80), and extrapolation suggests unrealistic sizes (2e19 params) would be needed for 90% robustness
The method eliminates the 'Out-Of-Subject' problem inherent in generation metrics by forcing a choice between controlled entities

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graph triples (Subject, Relation, Object)
Language Model probability generation
Information Retrieval (TF-IDF)
Beam Search decoding

Key Terms

cloze sentence: A sentence with a blank space that the model is asked to fill (e.g., 'The capital of France is ____')

distractor: An incorrect but plausible alternative answer used to test if the model can distinguish truth from likely falsehoods

OOS: Out-Of-Subject continuation—text generated by an LM that is grammatically correct but irrelevant to the factual query (e.g., 'Paris is... a nice city')

Kendall's τ: A correlation coefficient used to measure the ordinal association between two measured quantities (here, metric scores vs. human ratings)

ApprOpt: Approximation of Optimal Distractors—a strategy to find distractors by beam-searching the LM's own high-probability generations constrained to incorrect entities

Plausibility: The sum of probabilities of all valid labels (aliases) for a specific entity given a context

KaRR: Knowledge-as-Reranking—a statistical metric estimating the ratio between the probability of generating the correct answer given the LM's distribution versus by pure chance

TF-IDF: Term Frequency-Inverse Document Frequency—a statistical measure used to evaluate how relevant a word is to a document in a collection

Precision@n: A metric checking if the correct answer appears in the top-n most probable generations

Wikidata: A free, collaborative, multilingual, secondary database, collecting structured data to provide support for Wikipedia