SAE: Sparse Autoencoder—an unsupervised learning model used to decompose dense neural network representations into sparse, interpretable features (latents)
residual stream: The primary vector pathway in a Transformer where information is added by attention and MLP layers
activation steering: Intervention technique where a specific vector (feature direction) is added to the model's internal activations to influence behavior
knowledge refusal: When a model declines to answer a query because it lacks the necessary factual information
JumpReLU: A specific activation function for SAEs that zeroes out values below a threshold and passes others linearly
logit difference: The difference in prediction scores between two competing tokens (e.g., 'Yes' vs 'No'), used to measure model preference
activation patching: A technique to isolate the causal effect of specific model components by swapping activations between two different runs (e.g., known vs. unknown entity inputs)