The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination

📝 Paper Summary

Factuality and Hallucination Mechanism Interpretability Decoding Strategies

The paper identifies 'knowledge overshadowing'—where dominant knowledge suppresses less frequent facts—as a primary cause of hallucinations and proposes a log-linear law to predict it and a decoding strategy to mitigate it.

Core Problem

LLMs hallucinate even when trained on strictly factual data because dominant (popular) knowledge patterns suppress less prominent correct information during generation.

Why it matters:

Current beliefs often attribute hallucinations solely to low-quality or incorrect training data, but this paper shows error persists even with 100% factual corpora
Existing methods detect hallucinations only after generation; there is no principled way to predict hallucination rates based on training data characteristics before training
Reliability in high-stakes domains requires understanding why models prioritize wrong associations (e.g., associating a country with its leader instead of a requested singer)

Concrete Example: When queried for 'famous singer in North Korea', the model incorrectly generates 'Kim Jong Un'. The strong association between 'North Korea' and 'Kim Jong Un' overshadows the specific constraint 'singer', causing the model to misassemble facts.

Key Novelty

The Law of Knowledge Overshadowing and CoDa Decoding

Introduces a log-linear law stating hallucination rates scale linearly with the logarithm of knowledge popularity, knowledge length, and model size
Proposes CoDa (Contrastive Decoding to Amplify Overshadowed Knowledge), which detects overshadowed concepts by masking dominant tokens and contrasting distributions to amplify the suppressed correct information

Architecture

Conceptual illustration of knowledge overshadowing. It shows two knowledge pieces: 'Kim Jong Un is a politician in North Korea' (Dominant) and 'Ri Sol-ju is a singer in North Korea' (Overshadowed).

Evaluation Highlights

+27.9% improvement in factuality on the custom Overshadowing dataset using the proposed CoDa decoding strategy
+13.1% improvement on the MemoTrap benchmark compared to standard decoding
+18.3% improvement on the NQ-Swap benchmark, demonstrating generalization to diverse factual tasks

Breakthrough Assessment

8/10

Significant theoretical contribution by formulating a scaling law for hallucinations, coupled with a practical, training-free decoding solution that yields substantial improvements.

⚙️ Technical Details

Problem Definition

Setting: Auto-regressive language modeling where the goal is to predict factual continuations without being misled by dominant but irrelevant associations

Inputs: A prompt containing a shared context and a specific entity/constraint (e.g., 'A famous singer in North Korea is')

Outputs: The correct factual completion (e.g., a singer's name) rather than a popular but incorrect association (e.g., a politician's name)

Pipeline Flow

Overshadowed Knowledge Detection (via Mutual Information)
Contrastive Decoding (Amplification)

System Modules

Overshadowed Knowledge Detection

Identify which tokens in the prompt are being ignored by the model

Model or implementation: The LLM itself (inference only)

Contrastive Decoding

Generate text by amplifying signals from the overshadowed tokens

Model or implementation: The LLM itself (inference only)

Novel Architectural Elements

Integration of Mutual Information-based token masking into the decoding process to dynamically identify and upweight suppressed information constraints

Modeling

Base Model: Llama-2-7B, Llama-2-13B, Llama-2-70B, GPT-2 (various sizes), OPT (various sizes)

Training Method: Pre-training from scratch (for scaling law experiments) and Inference-only (for CoDa)

Trainable Parameters: None (for CoDa method)

Training Data:

Synthetic dataset constructed with strictly controlled knowledge pairs (Knowledge A vs Knowledge B)
Controlled for relative popularity ratios (m:n) and length ratios

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Decoding: CoDa actively intervenes to suppress dominant associations, whereas standard decoding follows the highest likelihood path which often leads to hallucinations
vs. CD (Li et al., 2023): CoDa contrasts the same model against a version of itself with masked context, rather than using a separate amateur model [not cited in paper as direct baseline, but conceptual comparison]

Limitations

The log-linear law is primarily validated on synthetic data and specific controlled fine-tuning tasks; widespread validation on web-scale pre-training is difficult to quantify precisely.
The CoDa method requires multiple forward passes (calculating mutual information with masks), increasing inference cost.
Requires identification of knowledge pairs to strictly measure 'popularity', which is ambiguous in unstructured wild data.

Reproducibility

The paper provides mathematical derivations and algorithmic descriptions. Specific code URLs are not provided in the text. Synthetic dataset generation logic is described.

📊 Experiments & Results

Evaluation Setup

Controlled synthetic experiments to validate scaling laws; factual QA benchmarks to validate CoDa decoding.

Benchmarks:

Overshadow (Synthetic QA) [New]
MemoTrap (Trap QA (testing memorization vs. reasoning))
NQ-Swap (Counterfactual QA (Natural Questions with swapped entities))

Metrics:

Factuality Accuracy (Correctness)
Relative Hallucination Rate (R)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Decoding strategy performance comparisons showing CoDa's improvement over baselines across three datasets.
Overshadow	Factuality Improvement	0.0	27.9	+27.9
MemoTrap	Factuality Improvement	0.0	13.1	+13.1
NQ-Swap	Factuality Improvement	0.0	18.3	+18.3

Main Takeaways

Factual hallucinations follow a log-linear law: Rate ~ log(Popularity) + log(Length) + log(Model Size).
Larger models, surprisingly, can exhibit higher rates of factual hallucination due to stronger encoding of dominant (overshadowing) knowledge.
Knowledge overshadowing occurs even with 100% factual training data; it is an issue of data distribution and interaction, not just data correctness.
The CoDa strategy effectively mitigates this by identifying and amplifying the specific constraints that are being ignored.

📚 Prerequisite Knowledge

Prerequisites

Understanding of auto-regressive language generation
Familiarity with scaling laws in LLMs
Basic concepts of mutual information and entropy

Key Terms

Knowledge Overshadowing: A phenomenon where more prevalent (popular) knowledge representation competes against and suppresses less frequent knowledge, causing the model to ignore specific constraints

Log-Linear Law: A mathematical relationship found in this paper: Hallucination Rate ~ log(Popularity) + log(Length) + log(Model Size)

CoDa: Contrastive Decoding to Amplify Overshadowed Knowledge—the proposed decoding strategy that boosts the probability of tokens that are valid but suppressed by dominant associations

Contrastive Decoding: A generation method that selects tokens maximizing the difference between a strong expert model and a weaker amateur model (or in this case, modified prompts)

Mutual Information: A measure used here to quantify the dependence between the prompt and the next token, helping to identify which parts of the prompt are being ignored (overshadowed)

Knowledge Popularity: The relative frequency of a specific fact or entity within the training corpus

Knowledge Length: The proportional length (in tokens) of the specific distinguishing knowledge relative to the shared context