Logit Lens: An interpretability technique that projects intermediate layer representations into the vocabulary space to see what the model is 'thinking' at that specific layer
Activation Patching: A method to causally test which model components matter by swapping activations between a clean run and a corrupted run
Residual Stream: The primary data path in a Transformer where layers add their outputs; the 'highway' of information flow
English-centric mechanism: The phenomenon where multilingual models process concepts primarily in English embeddings during intermediate computation steps before translating to the target language
Translation Difference Vector: A steering vector calculated by subtracting the mean activation of fact-recall prompts from the mean activation of explicit translation prompts
In-Context Learning (ICL): Providing examples in the prompt to demonstrate the task; here used to derive a steering vector, not just for prompting
Conversion: The internal process where the model translates its intermediate English concept into the target language token during generation
MLP: Multilayer Perceptron—the feed-forward sub-layers in a Transformer, often associated with storing factual knowledge