Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

📝 Paper Summary

Hallucination suppression Mechanistic Interpretability

The paper identifies specific directions in LLM representation space using Sparse Autoencoders that encode whether the model recognizes an entity, enabling causal steering of refusal behaviors to reduce hallucinations.

Core Problem

LLMs frequently hallucinate when prompted about entities they do not know, and the internal mechanisms governing whether a model refuses to answer or invents facts are poorly understood.

Why it matters:

Hallucinations limit LLM deployment in critical fields like healthcare where factuality is essential
Current understanding focuses on factual recall of *known* facts, leaving a gap in understanding the mechanism of *unknown* facts and refusals
Fine-tuning for refusal behavior is effective but opaque; understanding the underlying mechanism allows for more robust control

Concrete Example: When asked 'When was the player Wilson Brown born?' (a non-existent entity), a base model might hallucinate a date. The proposed method detects the model's lack of recognition and can force a refusal ('I don't know...') or conversely force a hallucination on a known entity.

Key Novelty

Entity Recognition Directions via SAEs

Discovers specific directions in the residual stream (via SAE latents) that activate when the model processes a known vs. an unknown entity
Demonstrates these directions are causal: steering along them can force a chat model to refuse a known entity or hallucinate an unknown one
Finds that these mechanisms, discovered in the base model, are repurposed by the chat model to implement refusal behaviors

Architecture

The mechanistic circuit of factual recall and how the discovered latents interfere with it.

Evaluation Highlights

Steering with the 'unknown entity' latent induces nearly 100% refusal rates across diverse entity types (players, movies, cities) in Gemma 2 2B
Latents distinguishing known/unknown entities generalize across types (e.g., a latent found for athletes also works for songs)
Identifies that 'unknown' directions mechanically disrupt downstream attention heads responsible for attribute extraction

Breakthrough Assessment

8/10

Strong mechanistic evidence linking specific SAE features to high-level refusal behavior. The finding that base model features are repurposed for chat refusal is a significant insight for alignment.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of entities (Known vs. Unknown) and causal intervention on generation

Inputs: Prompt containing an entity e (e.g., 'Who is [Entity]?')

Outputs: Activation of specific SAE latents and subsequent text generation (refusal vs. answer)

Pipeline Flow

Entity Dataset Construction (Known vs. Unknown classification)
SAE Training/Analysis (Gemma Scope)
Latent Selection (Separation Score Calculation)
Steering / Causal Intervention

System Modules

Entity Classifier

Categorize entities as 'known' or 'unknown' based on model's ability to recall attributes

Model or implementation: Gemma 2 (2B and 9B)

Feature Extractor

Identify SAE latents that distinguish between known and unknown entities

Model or implementation: Gemma Scope SAEs (JumpReLU)

Steering Mechanism

Inject the identified feature vector into the residual stream to alter generation

Model or implementation: Gemma 2 IT (Chat)

Novel Architectural Elements

Use of SAE latents specifically for 'knowledge awareness' detection across entity types

Modeling

Base Model: Gemma 2 2B and 9B (Base and Instruct versions); Llama 3.1 8B (Appendix)

Training Method: Analysis of pre-trained models using pre-trained SAEs (Gemma Scope)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RepE: Uses unsupervised SAEs to find sparse, monosemantic features rather than supervised linear probes or PCA directions
vs. Arditi et al.: Focuses on *knowledge* refusal (I don't know) rather than *safety* refusal (I can't say)
vs. Yuksekgonul et al.: Identifies the causal mechanism (SAE latent) that *controls* the attention attention drop they observed

Limitations

Labeling of known/unknown entities relies on model generation and fuzzy matching, which may be noisy
Analysis focuses on specific entity types (players, movies, cities, songs); generalization to abstract concepts not tested
Does not solve the problem of models being confident but wrong (confidently hallucinating)
Relies on the quality and availability of pre-trained SAEs (Gemma Scope)

Reproducibility

Code: https://github.com/javiferran/sae_entities

Code available at GitHub. Relies on Gemma Scope and LlamaScope (public SAE suites). Dataset construction described in detail using Wikidata.

📊 Experiments & Results

Evaluation Setup

Steering experiments on factual questions about known/unknown entities

Benchmarks:

Custom Wikidata Entity Dataset (Factual Recall / Refusal) [New]

Metrics:

Latent Separation Score
Refusal Rate
Attention Score (to entity tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Custom Wikidata Entity Dataset	MaxMin Separation Score	Near 0	High (Peak at Layer 9)	Significant increase
Custom Wikidata Entity Dataset	Refusal Rate (Unknown Entities)	Variable (<50%)	~100%	Large increase
Custom Wikidata Entity Dataset	Refusal Rate (Orthogonalized)	Variable	Large reduction	Significant decrease

Experiment Figures

Scatter plot of SAE latent activations for known vs. unknown entities.

Bar charts showing refusal rates under different steering conditions.

Main Takeaways

Entity recognition latents are 'universal' across entity types (e.g., a latent found for movies also detects unknown cities)
Chat models repurpose features found in the base model (pre-training) to implement instruction-tuned refusal behaviors
Mechanism: The 'unknown' latent suppresses attention heads in later layers that usually copy entity attributes to the output, effectively breaking the factual recall circuit
There are distinct 'uncertainty' latents that predict incorrect answers even when the model attempts to answer (does not refuse)

📚 Prerequisite Knowledge

Prerequisites

Sparse Autoencoders (SAEs) and their role in disentangling superposition
Transformer architecture (residual streams, attention heads)
Mechanistic Interpretability concepts (activation patching, steering)

Key Terms

SAE: Sparse Autoencoder—an unsupervised learning model used to decompose dense neural network representations into sparse, interpretable features (latents)

residual stream: The primary vector pathway in a Transformer where information is added by attention and MLP layers

activation steering: Intervention technique where a specific vector (feature direction) is added to the model's internal activations to influence behavior

knowledge refusal: When a model declines to answer a query because it lacks the necessary factual information

JumpReLU: A specific activation function for SAEs that zeroes out values below a threshold and passes others linearly

logit difference: The difference in prediction scores between two competing tokens (e.g., 'Yes' vs 'No'), used to measure model preference

activation patching: A technique to isolate the causal effect of specific model components by swapping activations between two different runs (e.g., known vs. unknown entity inputs)