CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Mechanistic Interpretability: A research field aiming to reverse-engineer neural networks to understand the specific algorithms and circuits they implement
Activation Patching: A technique to localize model function by swapping internal activations between a clean run and a corrupted run to see if the output is restored
Functional Rift: The paper's term for the observation that early model layers and late model layers perform distinct, almost disjoint types of processing (ontology mapping vs. answer generation)
Induction Heads: Attention heads that copy patterns from the context (e.g., 'A followed B before, so predict B after A'), crucial for in-context learning
Logit Lens: A method to interpret intermediate layer activations by projecting them into the vocabulary space to see what token they would predict if the model stopped there
Hydra Effect: The phenomenon where removing one component of a model causes other components to compensate, making it difficult to isolate specific functions
PrOntoQA: A dataset of synthetic reasoning problems based on fictional ontologies (e.g., 'Numpuses are rompuses') used to test logical capacity without interference from real-world facts