Implicit Memory: Knowledge stored within the neural network's trainable parameters (weights)
Working Memory: The transient state stored in the context key-value cache during the processing of the current input sequence
Explicit Memory: The proposed format: sparse attention key-value pairs derived from text and stored externally, retrieved during inference
Knowledge Traversal: The inefficiency where an LLM activates all its parameters (and thus all stored knowledge) just to generate a single token
Top-k: A selection algorithm that keeps only the 'k' elements with the highest scores
MIPS: Maximum Inner Product Search—a technique to find vectors in a database that have the highest dot product with a query vector
Faiss: A library for efficient similarity search and clustering of dense vectors
RAG: Retrieval-Augmented Generation—enhancing models by retrieving relevant text chunks before generation
Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance
KV Cache: Key-Value Cache—storing previous calculations in Transformers to speed up sequential generation