Feed-Forward Layer (FFN): The position-wise processing block in a Transformer layer, consisting of two linear transformations with a non-linearity in between.
Key-Value Memory: A mechanism where an input is compared against 'keys' to compute weights, which are then used to retrieve a weighted sum of 'values'.
Trigger Example: A training example (text prefix) that results in the highest activation coefficient for a specific neuron/key.
Residual Connection: A skip connection that adds the input of a layer to its output, allowing gradients to flow more easily and information to be preserved.
Memory Coefficient: The scalar activation value resulting from the dot product of the input and a key vector (after non-linearity).
Softmax: A function that converts a vector of numbers into a probability distribution.
ReLU: Rectified Linear Unitβa non-linear activation function that outputs the input if positive, otherwise zero.