Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion

📝 Paper Summary

Mechanistic Interpretability Factuality and Hallucination Internal Knowledge Retrieval

The authors introduce PRISM, a method to disentangle fact completion into four distinct scenarios (exact recall, heuristics, guesswork, generic modeling), showing that previous interpretability findings conflate these behaviors.

Core Problem

Current mechanistic interpretability studies assume correct model predictions imply 'fact recall,' failing to distinguish between actual memorization, shallow heuristics, or lucky guesses.

Why it matters:

Treating all correct predictions as 'knowledge' leads to misleading conclusions about where and how LMs store facts.
Models relying on shallow heuristics (e.g., name bias) are unreliable and prone to hallucination, but current evaluation methods often mask this behavior.
Existing datasets like CounterFact contain mixtures of these behaviors (e.g., 510/1209 samples likely rely on heuristics), contaminating interpretability results.

Concrete Example: Given the query 'Astrid Lindgren was born in', a model might predict 'Sweden' because it memorized the fact (Recall) or because it associates Swedish-sounding names with Sweden (Heuristic). Previous methods treat both as identical 'knowledge,' obscuring the underlying mechanism.

Key Novelty

PRISM (Precise Identification of Scenarios for Model behavior)

Decomposes 'fact completion' into four distinct scenarios: Generic Language Modeling, Guesswork, Heuristics Recall, and Exact Fact Recall based on diagnostic criteria (confidence, heuristics usage, fact completion).
Demonstrates that widely accepted interpretability signatures (like mid-layer MLP importance) primarily hold for 'Exact Fact Recall' but vanish or shift for Heuristics and Guesswork.

Architecture

Conceptual diagram of the four prediction scenarios (Generic LM, Guesswork, Heuristics, Exact Fact Recall) and their diagnostic criteria.

Evaluation Highlights

Exact Fact Recall samples in GPT-2 XL confirm the importance of mid-range MLP sublayers (consistent with previous literature), but this pattern disappears for Heuristics Recall samples.
Interpretability results on mixed samples (simulating previous datasets) reproduce prior findings, proving that earlier conclusions were dominated by high-confidence recall samples while masking other behaviors.
A linear probe trained on internal states (Causal Tracing effects) achieves 0.72-0.78 accuracy in classifying the four prediction scenarios across GPT-2 XL and Llama 2 models.

Breakthrough Assessment

8/10

Significantly refines the understanding of 'knowledge' in LMs by proving that 'correct prediction' != 'fact recall.' The decomposition into four scenarios resolves inconsistencies in prior interpretability work.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive fact completion where a model predicts an object O given a subject S and relation R.

Inputs: Prompt templates expressing incomplete facts (e.g., '[S] was born in').

Outputs: Next token prediction (object [O]).

Pipeline Flow

Dataset Construction (PRISM)
Scenario Classification
Interpretability Analysis

System Modules

PRISM Dataset Construction

Generate/filter samples for 4 scenarios using diagnostic criteria: Fact Completion, Confident Prediction, and Heuristics Usage.

Model or implementation: Rules and filters applied to ParaRel/LAMA/Wikipedia data

Causal Tracing (CT) (Interpretability Analysis)

Identify which model layers/tokens are decisive for the prediction in each scenario.

Model or implementation: GPT-2 XL, Llama 2 (7B/13B)

Information Flow Analysis (Interpretability Analysis)

Analyze attention dependencies and attribute extraction at intermediate layers.

Model or implementation: GPT-2 XL

Novel Architectural Elements

The PRISM taxonomy itself serves as a novel structural framework for analyzing LM inference, effectively splitting the 'inference pipeline' analysis into four distinct logical paths based on input type.

Modeling

Base Model: GPT-2 XL (1.5B), Llama 2 7B, Llama 2 13B

Compute: Experiments performed on T4, A40, and A100 NVIDIA GPUs. Training time not reported (inference/analysis only).

Comparison to Prior Work

vs. Meng et al. (2022): PRISM shows Meng's findings only hold for 'Exact Fact Recall'. For 'Heuristics' and 'Guesswork', the critical mid-layer MLP signal is absent or different.
vs. Geva et al. (2023): Disproves the generalization that early non-subject layers are critical for all fact predictions; this holds for 'Guesswork'/Generic LM but is much less important for 'Exact Fact Recall' (15% vs 45% impact).
vs. CounterFact dataset: PRISM splits samples by mechanism rather than just correctness. The authors find ~42% of CounterFact's 'known' samples are likely heuristics.

Limitations

Heuristics filters are leaky; some name-based biases (especially for non-person subjects) might go undetected.
Confidence is proxied by consistency across paraphrases, which may not perfectly align with true model uncertainty.
Analysis focuses on auto-regressive models and subject-first templates; encoder-decoder models not tested.
Synthetic data used for Heuristics Recall (to ensure zero memorization) might differ slightly from natural language distributions.

Reproducibility

Code: https://github.com/dsaynova/lm_interpretation_fact_completion

Code and datasets are publicly available at https://github.com/dsaynova/lm_interpretation_fact_completion. The PRISM dataset creation process is fully detailed with algorithms for each scenario (Generic LM, Guesswork, Heuristics, Exact Recall).

📊 Experiments & Results

Evaluation Setup

Mechanistic analysis of 3 models (GPT-2 XL, Llama 2 7B, 13B) on 4 prediction scenarios.

Benchmarks:

PRISM (GPT-2 XL) (Fact Completion / Analysis) [New]
PRISM (Llama 2 7B/13B) (Fact Completion / Analysis) [New]

Metrics:

Averaged Indirect Effect (AIE)
Probability change under Attention Knockout
Attribute Extraction Rate (via Logit Lens)
Probe Accuracy (for classifying scenarios)
Statistical methodology: 95% confidence intervals reported for Causal Tracing and Information Flow results.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Causal Tracing (CT) results show that the 'standard' fact recall pattern (mid-layer MLP importance) is unique to the Exact Fact Recall scenario and does not generalize.
PRISM (GPT-2 XL)	AIE Peak Location	High	High (only for Exact Fact Recall)	Consistent
PRISM (GPT-2 XL)	AIE Peak Location (Heuristics)	Mid-layer MLP peak	No decisive peak	Qualitative difference
PRISM (GPT-2 XL)	AIE Peak Location (Guesswork)	Late layer Last token	Late layer Last token	Matches Generic LM
Information Flow analysis reveals distinct mechanisms for computing predictions across scenarios.
PRISM (GPT-2 XL)	Prob drop (Subject Attn)	60%	80%	+20%
PRISM (GPT-2 XL)	Prob drop (Subject Attn)	80%	7%	-73%
PRISM (Llama 2 7B)	Accuracy	0.25	0.78	+0.53

Experiment Figures

Causal Tracing (AIE) heatmaps for GPT-2 XL across the four PRISM scenarios.

Attention Knockout results showing probability drop when blocking attention to specific tokens.

Main Takeaways

Interpreting LMs on datasets like CounterFact is misleading because they mix 'bona fide' recall with heuristics and guesswork, which have different mechanistic signatures.
Exact Fact Recall relies heavily on mid-layer MLPs at the subject token (confirming ROME), whereas Guesswork and Heuristics rely more on late-layer processing at the last token.
Models can be surprisingly accurate using Heuristics or Guesswork, but these processes are mechanistically distinct from retrieving a stored fact.
Generic Language Modeling serves as a strong baseline for 'non-fact' behavior; Guesswork mechanistically resembles Generic LM more than it resembles Fact Recall.

📚 Prerequisite Knowledge

Prerequisites

Mechanistic Interpretability (Causal Tracing, Information Flow)
Transformer Architecture (MLP layers, Attention heads)
Language Model Knowledge storage theories (Key-Value memories)

Key Terms

Causal Tracing (CT): A method to locate important model states by corrupting an input embedding and restoring specific internal activations to recover the correct prediction.

Information Flow: Analysis of how information moves through the model, often using attention knockout (blocking edges) or logit lens (decoding intermediate states).

MLP: Multilayer Perceptron—the feed-forward sublayers in a Transformer block, often hypothesized to store factual knowledge.

PRISM: Precise Identification of Scenarios for Model behavior—the authors' proposed method for creating diagnostic datasets separating recall, heuristics, and guesswork.

Indirect Effect: The measure in Causal Tracing representing how much restoring a specific state contributes to the probability of the correct output.

ParaRel: A dataset of paraphrased relational facts used to test consistency and confidence.

CounterFact: A standard dataset for evaluating fact editing and recall; the authors argue it mixes different prediction scenarios.

LAMA: LAnguage Model Analysis—a probe dataset checking factual knowledge in LMs using cloze-style queries.

Attention Knockout: An interpretability technique that zeroes out attention weights from specific tokens to see how it affects the output probability.

Logit Lens: A technique to interpret intermediate layer representations by projecting them into the vocabulary space to see what token they currently predict.