Transformer feed-forward layers are key-value memories

📝 Paper Summary

Transformer Interpretability Mechanism Analysis

Feed-forward layers in Transformers function as key-value memories where keys detect specific input patterns (shallow or semantic) and values induce probability distributions over the output vocabulary.

Core Problem

Feed-forward layers constitute two-thirds of a Transformer's parameters, yet their specific role and internal mechanics remain largely unexplained compared to self-attention layers.

Why it matters:

Understanding the dominant parameter blocks (FFNs) is crucial for interpretability, model debugging, and architectural improvements.
Prior work focused heavily on self-attention, leaving a gap in understanding how the bulk of the model's capacity stores and retrieves information.
Clarifying this mechanism could aid in controlling model predictions or understanding data privacy risks (memorization).

Concrete Example: In a sentence ending with '...no substitutes', a specific FFN key activates. If we don't know this key correlates to the pattern 'substitutes at end of sentence' and its value predicts the next likely token, we cannot explain why the model outputs a specific word next.

Key Novelty

Feed-Forward Layers as Key-Value Memories

Reinterprets the first parameter matrix of an FFN as 'keys' that match input patterns (like soft-lookup) and the second matrix as 'values' that output token distributions.
Demonstrates that lower layers act as shallow pattern detectors (n-grams), while upper layers detect semantic concepts (topics).
Shows that the final prediction is a composition of these memory retrievals, refined layer-by-layer via residual connections.

Architecture

Illustration of the mapping between Feed-Forward Layer mathematics and Key-Value Memory mechanics.

Evaluation Highlights

In upper layers (11-16), value vectors correctly predict the next token of their key's trigger examples in up to 3.5% of cases (orders of magnitude above random chance ~0.0004%).
Human experts identified coherent patterns for 100% of analyzed keys (average 3.6 patterns per key).
In 68% of examples, the layer's final output differs from every single individual memory's top prediction, proving the mechanism is compositional.

Breakthrough Assessment

8/10

Provides a fundamental mechanistic explanation for the majority of Transformer parameters. Shifts the mental model of FFNs from generic non-linearities to interpretable memory stores.

⚙️ Technical Details

Problem Definition

Setting: Analysis of pre-trained Transformer language models

Inputs: Input vector x (representation of text prefix)

Outputs: Next token probability distribution

Pipeline Flow

Input x -> Key Matrix K (Pattern Matching)
Non-linearity (ReLU)
Value Matrix V (Distribution Generation)
Residual Aggregation

System Modules

Key Matrix (First Linear Layer) (Pattern Detection)

Acts as a set of pattern detectors; each row is a key that activates when the input matches a specific textual pattern.

Model or implementation: Linear Layer (d_model -> d_ff)

Non-linearity (Pattern Detection)

Filters out negative matches (inactive memories).

Model or implementation: ReLU

Value Matrix (Second Linear Layer)

Stores the output distribution associated with each key pattern.

Model or implementation: Linear Layer (d_ff -> d_model)

Novel Architectural Elements

Conceptual mapping: Treating standard FFN weights explicitly as discrete Key-Value pairs for analysis (not a new architecture, but a novel functional interpretation).

Modeling

Base Model: 16-layer Transformer Language Model (Baevski and Auli, 2019)

Training Method: Standard Language Modeling (Next Token Prediction)

Training Data:

WikiText-103

Key Hyperparameters:

layers: 16
hidden_dimension_d: 1024
ff_dimension_dm: 4096
+ 1 more
total_potential_keys: 65536

Compute: Not reported in the paper

Comparison to Prior Work

vs. Sukhbaatar et al. (2019): This paper analyzes existing canonical Transformers rather than proposing a new memory-augmented architecture.
vs. Neuro-X: Focuses specifically on the key-value structure of FFNs and the compositionality of the output, rather than just classifying neuron roles.
vs. Tenney et al.: Provides a mechanistic explanation (keys/values) for the layer-wise trends (shallow vs. semantic) observed in prior work.

Limitations

Analysis is limited to a specific Transformer architecture and dataset (WikiText-103); generalization to other architectures (e.g., BERT, GPT-3) is assumed but not tested.
The projection of value vectors to vocabulary space assumes layers operate in a shared embedding space, which may not hold strictly for lower layers.
Pattern identification relies on human annotation, which is subjective and time-consuming.
Does not propose an automated method to improve models based on these insights.

Reproducibility

Code: https://github.com/mega002/ff-layers/

Code is publicly available. The analysis relies on a standard pre-trained model (Baevski and Auli, 2019) trained on public data (WikiText-103).

📊 Experiments & Results

Evaluation Setup

Qualitative and quantitative analysis of neuron activations and weights in a pre-trained LM.

Benchmarks:

WikiText-103 (Language Modeling Analysis)

Metrics:

Pattern coherence (Human Evaluation)
Agreement rate (Top-1 Value prediction vs. Top-1 Trigger next-token)
Memory coefficient magnitude
Residual agreement probability
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WikiText-103	Identifyable Patterns	Not reported in the paper	100% of sampled keys	Not reported in the paper
WikiText-103	Trigger Coverage	Not reported in the paper	65%-80%	Not reported in the paper
WikiText-103	Agreement Rate (Top-1)	0.0004	0.035	0.0346
WikiText-103	Active Memories per Layer	Not reported in the paper	10-50% of 4096 dimensions	Not reported in the paper
WikiText-103	Unique Layer Prediction	0	68	68

Experiment Figures

Breakdown of expert-assigned labels (Shallow vs. Semantic) for trigger examples across layers 1-16.

Agreement rate between the value vector's top prediction and the key's trigger example's next token.

Main Takeaways

Lower layers (1-9) capture shallow patterns (n-grams), while upper layers (10-16) capture semantic patterns (topics).
Feed-forward layers act as pattern detectors where keys identify input contexts and values suggest next-token distributions.
The final prediction is not retrieved from a single memory but composed from hundreds of active memories.
Residual connections act as a refinement mechanism: bottom layers make rough predictions, which upper layers refine or veto.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-attention vs. Feed-forward)
Key-Value Memory Networks
Vector dot products and Softmax

Key Terms

Feed-Forward Layer (FFN): The position-wise processing block in a Transformer layer, consisting of two linear transformations with a non-linearity in between.

Key-Value Memory: A mechanism where an input is compared against 'keys' to compute weights, which are then used to retrieve a weighted sum of 'values'.

Trigger Example: A training example (text prefix) that results in the highest activation coefficient for a specific neuron/key.

Residual Connection: A skip connection that adds the input of a layer to its output, allowing gradients to flow more easily and information to be preserved.

Memory Coefficient: The scalar activation value resulting from the dot product of the input and a key vector (after non-linearity).

Softmax: A function that converts a vector of numbers into a probability distribution.

ReLU: Rectified Linear Unit—a non-linear activation function that outputs the input if positive, otherwise zero.