Sources of Hallucination by Large Language Models on Inference Tasks

📝 Paper Summary

Hallucination suppression Knowledge internalization

LLMs hallucinate in inference tasks because they rely on memorized training sentences and simple corpus frequency heuristics rather than performing robust logical reasoning.

Core Problem

Large Language Models often claim false entailments (hallucinations) in natural language inference tasks, but the specific causes rooted in pre-training data distribution are opaque.

Why it matters:

If models rely on memory rather than reasoning, they will fail on novel inputs or private data (e.g., legal docs) that contradict pre-training data
Understanding specific biases allows for better evaluation controls and helps explain why models present incorrect information as fact in downstream tasks like QA
Current trust in LLMs for logical reasoning may be misplaced if performance stems from shallow heuristics

Concrete Example: If an LLM has memorized 'Whiskey consists chiefly of alcohol', it might incorrectly claim that 'Whiskey contains alcohol' entails 'Whiskey consists chiefly of alcohol' just because the premise is less frequent in the corpus than the hypothesis, or because the hypothesis is a memorized sentence.

Key Novelty

Attestation and Frequency Bias Probe

Identifies 'Attestation Bias': Models are more likely to label a relationship as 'Entailment' if the hypothesis sentence appears verbatim in their pre-training data
Identifies 'Relative Frequency Bias': Models default to 'Entailment' if the premise event is less frequent in general text than the hypothesis event, mimicking a 'specific-to-general' heuristic
Demonstrates that named entities act as 'indices' for memory recall; replacing entities with generic types or rare names breaks the model's reliance on memorization

Evaluation Highlights

GPT-3.5 is 2.2x more likely to wrongly predict Entailment on random premises if the hypothesis is attested in training memory
Performance drops massively when biases contradict labels: LLaMA-65B falls from 65.5% AUC (consistent) to 8.1% (adversarial) on Attestation bias samples
GPT-3.5 recall drops from 92.3% to 55.3% when entities in the Levy/Holt dataset are replaced with frequent random entities, proving reliance on specific entity memorization

Breakthrough Assessment

7/10

Strong behavioral analysis paper that effectively isolates specific mechanisms (memory and frequency) causing hallucinations. It doesn't propose a new architecture but provides crucial diagnostic insights for the field.

⚙️ Technical Details

Problem Definition

Setting: Natural Language Inference (NLI) / Textual Entailment, focusing on directional predicates

Inputs: Premise P and Hypothesis H

Outputs: Label: Entailment, Neutral, or Contradiction

Pipeline Flow

Dataset Transformation (Attestation check, Frequency check, Entity replacement)
Prompt Generation (Few-shot NLI queries)
Model Querying (LLaMA, GPT-3.5, PaLM)
Bias Analysis (Correlating predictions with Attestation/Frequency signals)

System Modules

Dataset Transformer

Create control datasets: Random Premise (to test memory), Generic Arguments (to remove entity indices), and Random Arguments

Model or implementation: Script-based using Entity Linker and N-grams

Attestation Probe

Determine if the model has memorized the hypothesis H

Model or implementation: Target LLM (Self-query)

Inference Engine

Predict entailment label for NLI pairs

Model or implementation: LLaMA-65B, GPT-3.5 (text-davinci-003), or PaLM-540B

Modeling

Base Model: LLaMA-65B, PaLM-540B, GPT-3.5 (text-davinci-003)

Training Method: Analysis paper—uses pre-trained/instruct models directly. No new training performed.

Compute: Not reported in the paper

Comparison to Prior Work

vs. Poliak et al. (2018): Probes LLM memory directly via prompting without training, showing sensitivity to pre-training attestation rather than dataset artifacts
vs. Dasgupta et al. (2022): Focuses on realistic linguistic inference tasks (NLI) rather than abstract reasoning tests
vs. Li et al. (2022): Extends analysis to massive LLMs (65B+ params) and identifies specific entity-based memory indexing mechanisms

Limitations

Attestation is measured by model self-prediction rather than checking training data (which is proprietary/huge)
Google N-grams is a proxy for corpus frequency, not the exact training distribution of the models
Analysis is limited to LLaMA, GPT-3.5, and PaLM; does not cover GPT-4 fully (briefly in appendix)
Residual performance remains unexplained after controlling for identified biases

Reproducibility

Code: https://github.com/Teddy-Li/LLM-NLI-Analysis

📊 Experiments & Results

Evaluation Setup

Few-shot (4-shot) prompting on NLI datasets with controlled transformations to isolate biases

Benchmarks:

Levy/Holt (Directional predicate entailment)
RTE-1 (General textual entailment)
Random Premise Task (IRandPrem) (Adversarial NLI (all labels are No-Entail)) [New]

Metrics:

Estimated Probability of predicting Entail (conditioned on bias)
Recall (across entity transformations)
AUC norm (Area Under Precision-Recall Curve, normalized)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiment 1: Attestation Bias. Models are much more likely to incorrectly predict Entailment on random premises if the hypothesis is attested (memorized).
Levy/Holt (Random Premise)	Ratio of False Positive Probability (Attested vs Unattested)	1.0	1.9	+0.9
Levy/Holt (Random Premise)	Ratio of False Positive Probability (Attested vs Unattested)	1.0	2.2	+1.2
Experiment 2: Entities as Indices. Recall drops significantly when original entities are replaced, even if type constraints preserve semantic validity.
Levy/Holt	Recall	92.3	55.3	-37.0
Levy/Holt	Recall	76.2	52.4	-23.8
Impact on Performance (AUC norm). Models perform near-randomly when the attestation bias contradicts the gold label.
Levy/Holt (Standard I)	AUC norm	65.5	8.1	-57.4
Levy/Holt (Standard I)	AUC norm	85.0	10.8	-74.2
Levy/Holt (Standard I)	AUC norm	79.1	31.5	-47.6

Experiment Figures

Bar chart showing the probability of predicting Entail for original Levy/Holt entries conditioned on whether the model attests (memorizes) the hypothesis

Bar chart for the 'Random Premise' task (where all correct labels are No-Entail). Shows probability of false positive Entailment conditioned on Attestation.

Bar chart showing the probability of false positive Entailment on random premises conditioned on Relative Frequency Bias (Φ)

Main Takeaways

LLMs consistently use named entities as 'indices' to recall memory; performance degrades when these entities are swapped, even if the logic remains identical
A relative frequency bias (Φ) exists where models prefer entailments where the premise is less frequent than the hypothesis (specific -> general); this bias becomes more pronounced when entity-based memory is unavailable
High aggregate NLI scores mask severe fragility: models are excellent when biases align with labels (Consistent) but fail catastrophically when they conflict (Adversarial)
These biases appear across LLaMA, PaLM, and GPT-3.5, indicating they stem from the pre-training objective on natural text rather than fine-tuning

📚 Prerequisite Knowledge

Prerequisites

Understanding of Natural Language Inference (NLI)
Basic knowledge of LLM pre-training objectives (next token prediction)
Familiarity with dataset artifacts/biases

Key Terms

Attestation Bias (Λ): The tendency of an LLM to affirm entailment simply because the hypothesis sentence is attested (memorized) from its training data, regardless of the premise

Relative Frequency Bias (Φ): The tendency of an LLM to affirm entailment if the premise's predicate is less frequent in the corpus than the hypothesis's predicate, mimicking a specific-to-general relationship

NLI: Natural Language Inference—determining whether a hypothesis is true given a premise

Levy/Holt: A specific NLI dataset focused on directional entailment between predicates (e.g., 'murder' entails 'kill' but 'kill' does not entail 'murder')

RTE-1: Recognizing Textual Entailment 1—a classic, difficult NLI dataset

AUC norm: Area Under the Curve normalized so that 0% represents random chance performance and 100% represents perfect classification

Hallucination: In this context, specifically false positive entailments where the model claims a logical relationship exists when it does not

Directional Entailment: Relationships that hold in one direction but not both (e.g., 'buy' entails 'own', but 'own' does not entail 'buy')

zero-shot: Asking the model to perform a task without providing any examples in the prompt

few-shot: Providing the model with a small number of examples (here, 4) in the prompt before the target query

entity linker: A tool that identifies named entities in text and links them to a knowledge base (here, Freebase IDs)

FIGER types: A set of fine-grained entity types (e.g., 'person', 'location') used to categorize named entities