DLA: Direct Logit Attribution—a technique to measure the direct contribution of a model component to the output logits by projecting its output onto the unembedding matrix
Residual Stream: The primary vector space in Transformers where information accumulates layer by layer via addition
Mixed Heads: Attention heads that attend to both subject and relation tokens, effectively performing two distinct additive updates simultaneously
Logit Lens: A technique interpreting internal activations by applying the final unembedding matrix to see what token the model would predict at intermediate layers
Subject Heads: Heads attending primarily to the subject to extract subject-related attributes (e.g., knowing Colosseum implies Rome)
Relation Heads: Heads attending primarily to the relation tokens to extract valid relation types (e.g., knowing 'country of' implies a list of countries)
Reversal Curse: The phenomenon where LLMs trained on 'A is B' fail to generalize to answering 'B is A'
Unembedding: The final linear layer of a language model that maps the residual stream state to a probability distribution over the vocabulary