Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

📝 Paper Summary

Mechanistic Interpretability Knowledge Storage and Retrieval Transformer Circuit Analysis

Factual recall in LLMs is performed by four distinct mechanisms (Subject Heads, Relation Heads, Mixed Heads, MLPs) that additively combine to boost the correct answer.

Core Problem

While prior work localized knowledge to early MLP layers, it remained unclear how models move and use this information to surface specific facts in the final output.

Why it matters:

Understanding how LLMs store and retrieve knowledge is crucial for interpretability, safety, and editing model behaviors
Narrow circuit analysis often neglects secondary information sources (like relation tokens), leading to incomplete understandings of model predictions
Phenomena like the 'reversal curse' (A is B does not imply B is A) lack a clear mechanistic explanation

Concrete Example: For the prompt 'Fact: The Colosseum is in the country of', the model must combine information about the subject 'Colosseum' (e.g., implies Italy, Rome) and the relation 'country' (e.g., implies Italy, Spain) to output 'Italy' rather than 'Rome' or 'Spain'.

Key Novelty

The Additive Motif for Factual Recall

LLMs solve factual recall using multiple independent components (Subject Heads, Relation Heads, Mixed Heads, MLPs) that individually push for different attribute sets but constructively interfere on the correct answer
Introduces 'DLA by source token group' to disentangle Mixed Heads, showing they are sums of separate updates from Subject and Relation tokens

Architecture

A conceptual diagram of the four independent mechanisms (Subject Heads, Relation Heads, Mixed Heads, MLPs) contributing to the final output.

Evaluation Highlights

Identified distinct 'Subject Heads' (ratio > 10 subject/relation DLA) that extract attributes regardless of the relation prompt
Identified distinct 'Relation Heads' (ratio > 10 relation/subject DLA) that extract relation targets regardless of the subject
Demonstrated that 'Mixed Heads' act as the sum of both subject and relation extractions, attending to both positions simultaneously

Breakthrough Assessment

7/10

Provides a significant advance in mechanistic understanding of factual recall by decomposing it into additive components, though limited to a small dataset and specific model (Pythia-2.8b).

⚙️ Technical Details

Problem Definition

Setting: Factual recall in autoregressive transformer models

Inputs: Prompts of form 'Fact: [Subject s] [Relation r]'

Outputs: The correct attribute [a] (e.g., 'Italy') as the next token

Pipeline Flow

Subject Enrichment (MLPs)
Mechanism 1: Subject Heads (Extract S)
Mechanism 2: Relation Heads (Extract R)
Mechanism 3: Mixed Heads (Extract S and R)
Mechanism 4: MLPs (Boost R)
Constructive Interference (Summation)

System Modules

Subject Enrichment

Enrich internal representations of subjects with known facts in early MLP layers

Model or implementation: MLP Layers (early)

Subject Heads (Extraction)

Attend to SUBJECT and extract attributes relevant to the subject (Set S)

Model or implementation: Attention Heads

Relation Heads (Extraction)

Attend to RELATION and extract attributes relevant to the relation type (Set R)

Model or implementation: Attention Heads

Mixed Heads (Extraction)

Attend to both SUBJECT and RELATION, performing both extraction tasks simultaneously

Model or implementation: Attention Heads

Output MLPs (Extraction)

Boost many attributes in the relation set R (similar to Relation Heads)

Model or implementation: MLP Layers (late)

Novel Architectural Elements

Decomposition of attention head outputs into additive components based on source token attention (Subject vs. Relation contribution within a single head)

Modeling

Base Model: Pythia-2.8b

Comparison to Prior Work

vs. ROME: Focuses on how information is moved/used after storage, not just localization
vs. Geva et al.: Finds mechanisms are additive and parallel rather than purely sequential; identifies 'Mixed Heads' and specific 'Relation Heads'
vs. Hernandez et al.: Identifies the specific transformer components (heads) that implement the linear decoding in the wild

Limitations

Analysis primarily focuses on Pythia-2.8b; generalization to larger models or other families is briefly touched upon but not exhaustive
Dataset is small and hand-written
Does not explain *why* the additive mechanism is preferred during training, only that it exists
Assumes 'attributes' and 'facts' align cleanly with token sets S and R, which may not be perfect

Reproducibility

The paper uses Pythia-2.8b (open weights). The dataset is hand-written but inspired by CounterFact and ParaRel. No specific code repository is provided in the text.

📊 Experiments & Results

Evaluation Setup

Mechanistic analysis of factual recall prompts

Benchmarks:

Custom Factual Dataset (Factual Recall / Sentence Completion) [New]

Metrics:

Direct Logit Attribution (DLA)
Attention Ratio (Subject vs. Relation)
Logit Rank
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Custom Factual Dataset	DLA Ratio (Subject/Relation)	10	10	0
Custom Factual Dataset	DLA (Subject Head L17H2)	0.1	1.1	+1.0
Custom Factual Dataset	DLA (Relation Head L13H31)	0.6	0.65	+0.05
Custom Factual Dataset	Attention Pattern (Subject Heads)	Low	High (to Subject)	Significant

Experiment Figures

Scatter plot of Attention Heads based on DLA attributed to Subject vs. Relation tokens.

Bar charts showing DLA contributions of top heads for specific attributes (Subject vs Relation attributes).

Main Takeaways

Models use a 'sum of parts' strategy: Subject heads provide subject-specific guesses, Relation heads provide valid types, and Mixed heads provide both.
Constructive Interference: While individual heads might not rank the correct answer #1 (e.g., Relation heads boost all countries), the sum of Subject + Relation + Mixed heads pushes the correct fact (Intersection of S and R) to the top.
Subject heads 'misfire': They extract subject attributes even when the relation prompt is different (e.g., extracting sports when asked for a country).
Reversal Curse Explanation: The mechanism is unidirectional (extracting B from A). Training 'A is B' boosts A->B heads but does not create B->A heads.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, MLP, Residual Stream)
Mechanistic Interpretability concepts (Circuits, Logit Lens)
Linear Algebra (vector spaces, projections)

Key Terms

DLA: Direct Logit Attribution—a technique to measure the direct contribution of a model component to the output logits by projecting its output onto the unembedding matrix

Residual Stream: The primary vector space in Transformers where information accumulates layer by layer via addition

Mixed Heads: Attention heads that attend to both subject and relation tokens, effectively performing two distinct additive updates simultaneously

Logit Lens: A technique interpreting internal activations by applying the final unembedding matrix to see what token the model would predict at intermediate layers

Subject Heads: Heads attending primarily to the subject to extract subject-related attributes (e.g., knowing Colosseum implies Rome)

Relation Heads: Heads attending primarily to the relation tokens to extract valid relation types (e.g., knowing 'country of' implies a list of countries)

Reversal Curse: The phenomenon where LLMs trained on 'A is B' fail to generalize to answering 'B is A'

Unembedding: The final linear layer of a language model that maps the residual stream state to a probability distribution over the vocabulary