Associative Memory: A mechanism that stores key-value pairs and retrieves values based on a query key, often implemented here via kernel smoothing
Nadaraya-Watson estimator: A non-parametric regression method that estimates a conditional expectation as a weighted average of observed values, using a kernel function to determine weights
Predictive Disentanglement: The phenomenon where a model spontaneously decomposes a complex prediction task into independent, simpler sub-tasks assigned to different heads/units during training
Induction Head: A circuit in Transformers that copies information from previous occurrences of a token pattern (e.g., [A][B] ... [A] -> predict [B])
Exchangeability: The property where the order of stored key-value pairs in memory does not affect the retrieval outcome
Meta-learning: In this context, the training process that learns *how* to construct keys and values (the learning algorithm), while the inference time process is the *application* of that rule to specific data