suffix array (SA): A data structure that stores all suffixes of a text in lexicographical order, allowing fast substring search
infini-gram: A search engine engine built on suffix arrays that efficiently counts and locates query strings in massive corpora (trillions of tokens)
maximal matching span: A sequence of tokens in the output that appears in the training data and cannot be extended left or right while maintaining a match
span unigram probability: The product of the unigram probabilities of all tokens in a span; used to measure how 'surprising' or unique a span is
BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query based on term frequency and document length
LCP: Longest Common Prefix—the length of the shared initial sequence between two strings
SFT: Supervised Fine-Tuning—a training phase using labeled instruction-following data
DPO: Direct Preference Optimization—a training method to align models with human preferences using paired data
RLVR: Reinforcement Learning via Verification Rules—a post-training method likely used for math/logic reasoning (implied by context)