DPO: Direct Preference Optimization—a fine-tuning method that aligns models to human preferences by optimizing a policy directly on preference pairs without a separate reward model
MLP: Multilayer Perceptron—the feed-forward sub-layers in Transformer models where much of the 'knowledge' processing is hypothesized to occur
Activation Patching: A causal analysis technique where specific internal model activations are swapped with those from a different run (e.g., post-DPO) to measure their effect on the output
Linear Probe: A simple linear classifier trained on internal model states to identify specific features (like toxicity)
LogitLens: A technique to interpret internal vectors by projecting them directly into the vocabulary space using the model's output embedding matrix
RealToxicityPrompts: A dataset of prompts designed to trigger toxic continuations from language models, used here as a stress test for safety
Perplexity: A measurement of how well a probability model predicts a sample; lower perplexity indicates the model is less 'surprised' by the text (better language quality)
F1 score: In this context, a metric measuring the overlap of generated text with reference text to assess general language capability preservation
Value Vector: The output vector of a specific neuron in an MLP layer before it is summed into the residual stream
GLU: Gated Linear Unit—a variant of the MLP layer used in modern LLMs (Llama, Mistral) that uses element-wise gating
Activation Steering: Modifying the model's internal activations at inference time (usually by subtracting a vector) to change behavior without retraining weights