RETRO: Retrieval-Enhanced Transformer—the proposed architecture that retrieves text chunks to augment generation
Chunked Cross-Attention (CCA): A mechanism that allows the model to attend to retrieved text chunks corresponding to the current input chunk
SCaNN: Scalable Nearest Neighbors—a library for efficient vector similarity search used to query the massive database
MassiveText: A large multilingual text dataset (5 trillion tokens) used for training and constructing the retrieval database
bpb: Bits-per-byte—a metric for language modeling performance, independent of the tokenizer vocabulary size
leakage: When evaluation data is inadvertently present in the training set, artificially inflating performance scores
frozen retriever: Using a pre-trained embedding model (like BERT) that is not updated during the training of the main language model
The Pile: A diverse, open-source language modeling dataset consisting of 22 smaller datasets (e.g., PubMed, ArXiv, GitHub)
autoregressivity: The property where a model predicts the next step based solely on previous steps, maintaining causal order
DPR: Dense Passage Retrieval—a method using dual encoders to retrieve relevant documents for open-domain question answering
perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance