A Primer on the Inner Workings of Transformer-based Language Models

📝 Paper Summary

Mechanistic Interpretability Transformer Analysis

This primer provides a unified technical framework and notation for understanding the internal mechanisms of decoder-only Transformers, categorizing techniques into localization of predictions and decoding of learned representations.

Core Problem

Rapid progress in large language models has created a need to contextualize dispersed interpretability insights and standardize the technical understanding of how these black-box models process information internally.

Why it matters:

Understanding internal mechanisms is crucial for ensuring AI safety, fairness, and error mitigation in critical settings
Previous surveys focused heavily on encoder-based models like BERT, leaving a gap for modern decoder-only generative architectures
Disparate notations and terminologies across research papers make it difficult to connect insights and identify common mechanisms

Concrete Example: In a standard linear network, representations are rotationally invariant, making it hard to isolate meaningful features. However, the paper explains how the element-wise nonlinearity in Feed-Forward Networks creates a 'privileged basis,' forcing features to align with specific neurons—a crucial insight for interpretability that is often misunderstood without a unified theoretical view.

Key Novelty

Unified Technical Primer for Transformer Interpretability

Establishes a unified mathematical notation to describe model components (Residual Stream, Attention, MLP) and interpretability methods, revealing connections between seemingly different approaches
Categorizes the vast literature into two primary dimensions: 'localization' (identifying components responsible for predictions) and 'decoding' (extracting information from representations)
Synthesizes discrete mechanical insights (e.g., OV circuits, induction heads) into a comprehensive overview of known internal mechanisms within a single consistent framework

Architecture

Decomposition of the Transformer architecture into the Residual Stream view

Breakthrough Assessment

8/10

Excellent foundational resource that systematizes the chaotic field of mechanistic interpretability. While it doesn't propose a new model, its unification of notation and concepts is a significant contribution to the research community.

⚙️ Technical Details

Problem Definition

Setting: Analysis of auto-regressive language models parametrizing conditional probability distributions over token sequences

Inputs: Sequence of tokens t = <t_1, t_2, ..., t_n>

Outputs: Probability distribution over the vocabulary for the next token t_{n+1}

Pipeline Flow

Embedding (Token -> Vector)
Transformer Blocks (Repeated L times)
Unembedding (Vector -> Logits)

System Modules

Input Embedding

Maps discrete tokens to dense row vectors

Model or implementation: Lookup Matrix W_E

Layer Normalization (Transformer Block)

Stabilizes training by centering and scaling representations

Model or implementation: Affine transformation (conceptually)

Attention Heads (Transformer Block)

Contextualizes representations by moving information between positions

Model or implementation: QK (Routing) and OV (Processing) Circuits

Feed-Forward Network (FFN) (Transformer Block)

Process information at a specific position; acts as key-value memory

Model or implementation: MLP (Up-projection -> Nonlinearity -> Down-projection)

Unembedding

Projects final residual state to vocabulary logic space

Model or implementation: Linear Matrix W_U

Novel Architectural Elements

Formalization of the 'Residual Stream' view where every component (Attention, MLP) writes additively to a shared channel
Decomposition of Attention into independent QK (reading/routing) and OV (writing/processing) circuits
Viewing LayerNorm as a fold-able affine transformation to simplify linear analysis

Modeling

Base Model: Decoder-only Transformer (GPT-like)

Comparison to Prior Work

vs. BERT surveys: Focuses specifically on generative decoder-only dynamics (unidirectional attention, next-token prediction)
vs. General XAI: Focuses on 'mechanistic interpretability' (reverse-engineering weights and circuits) rather than just feature attribution
vs. Elhage et al. (2021) [A Mathematical Framework...]: Formalizes and extends their residual stream/circuit notation into a broader survey context

Limitations

Focuses primarily on decoder-only architectures, though insights may apply elsewhere
Assumes LayerNorm scaling is constant for the linear decomposition analysis (an approximation)
Does not cover state-space models or non-Transformer architectures

Reproducibility

This is a survey/primer paper. No new code or models are released, but it provides a comprehensive notation for reproducing analyses on existing models.

📊 Experiments & Results

Evaluation Setup

Theoretical analysis and synthesis of existing literature

Metrics:

Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Transformers can be mathematically decomposed into a sum of independent component contributions via the residual stream
Attention heads consist of two distinct mechanisms: QK circuits (routing information) and OV circuits (processing information)
Feed-Forward Networks act as key-value memories where keys detect patterns in the input and values write features to the residual stream
Non-linearities are crucial for creating a 'privileged basis' that makes individual neurons interpretable
Model components can be composed into 'virtual' circuits (e.g., virtual attention heads) to explain complex behaviors like induction

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (matrix multiplication, subspaces)
Deep Learning basics (Backpropagation, Activation functions)
Transformer Architecture (Attention, LayerNorm, Residual connections)

Key Terms

Residual Stream: The primary vector pathway in a Transformer where information is added by attention and MLP layers, preserving a linear additive structure throughout the network

OV Circuit: Output-Value circuit; the component of an attention head (W_V * W_O) responsible for writing information to the residual stream based on what was attended to

QK Circuit: Query-Key circuit; the component of an attention head (W_Q * W_K^T) responsible for determining which previous tokens to attend to (computing attention patterns)

Privileged Basis: A specific basis in the activation space (usually aligned with neurons) where individual dimensions carry semantic meaning, enforced by element-wise nonlinearities

LayerNorm: A normalization operation that centers and scales representations; geometrically interpreted as projecting inputs onto a hyperplane and then mapping to a hypersphere

Induction Heads: Specific attention mechanisms that copy the token following a previous occurrence of the current token, enabling in-context learning

Virtual Attention Heads: Effective attention operations formed by the composition of attention heads across different layers, allowing later heads to attend to information moved by earlier heads