Associative Memory: A component that stores key-value pairs and retrieves values via kernel regression (weighted sum based on key similarity), replacing Attention
Memory Mosaics: A neural architecture constructed from networks of associative memories instead of attention heads
Kernel Regression: A non-parametric statistical method used here for memory retrieval, estimating values by smoothing over stored examples with a bandwidth parameter
Bandwidth: A parameter (beta) in the Gaussian kernel that controls how sharp or broad the retrieval focus is; analogous to temperature in softmax
Induction Head: A mechanism where a model learns to copy the token that followed a specific pattern in the past (e.g., if A follows B, predict A next time B appears)
Persistent Memory: The component in Memory Mosaics replacing the Feed-Forward Network (FFN), representing global static knowledge stored in weights
SwiGLU: Swish-Gated Linear Unit, a widely used activation function in modern LLMs (like Llama) for feed-forward layers
Ruler benchmark: A benchmark for evaluating long-context models using tasks with high information entropy, such as 'needle-in-a-haystack' variations