n-gram: A contiguous sequence of n items (tokens) from a given sample of text or speech
verbatim completion: When a language model generates the exact suffix of a text sequence when prompted with its prefix
lingering sequences: Text sequences that a model can still complete verbatim even after they have been explicitly filtered out of the training dataset
membership inference: The task of determining whether a specific data point was used to train a machine learning model
BPE: Byte-Pair Encoding—a tokenization method that iteratively merges the most frequent pair of bytes (or characters) into a single new token
token dropout: An adversarial technique where random tokens in a sequence are masked/dropped to prevent n-gram overlap while retaining semantic information
MinHash: An algorithm used to estimate the similarity of two sets (like documents) quickly, often used for approximate deduplication
suffix array: A data structure that enables efficient lookup of all substrings in a text corpus, used for exact deduplication
output suppression: The goal of machine unlearning where the model is prevented from generating specific sequences (e.g., harmful content)