Causal Tracing (CT): A method to locate important model states by corrupting an input embedding and restoring specific internal activations to recover the correct prediction.
Information Flow: Analysis of how information moves through the model, often using attention knockout (blocking edges) or logit lens (decoding intermediate states).
MLP: Multilayer Perceptron—the feed-forward sublayers in a Transformer block, often hypothesized to store factual knowledge.
PRISM: Precise Identification of Scenarios for Model behavior—the authors' proposed method for creating diagnostic datasets separating recall, heuristics, and guesswork.
Indirect Effect: The measure in Causal Tracing representing how much restoring a specific state contributes to the probability of the correct output.
ParaRel: A dataset of paraphrased relational facts used to test consistency and confidence.
CounterFact: A standard dataset for evaluating fact editing and recall; the authors argue it mixes different prediction scenarios.
LAMA: LAnguage Model Analysis—a probe dataset checking factual knowledge in LMs using cloze-style queries.
Attention Knockout: An interpretability technique that zeroes out attention weights from specific tokens to see how it affects the output probability.
Logit Lens: A technique to interpret intermediate layer representations by projecting them into the vocabulary space to see what token they currently predict.