BPE: Byte-Pair Encoding—an iterative algorithm that merges the most frequent pair of adjacent tokens into a new token
superword: A token that spans across whitespace boundaries, containing parts of multiple words or complete multi-word phrases (e.g., 'by the way')
subword: A token that is part of a word or a whole word, but strictly bounded by whitespace (standard in modern LLMs)
pretokenization: The step of splitting text into chunks (usually by whitespace) before the main tokenization algorithm runs, preventing merges across those chunks
transition point: The vocabulary size threshold t where SuperBPE switches from learning subwords (Stage 1) to learning superwords (Stage 2)
BPB: Bits-Per-Byte—a metric for language modeling loss normalized by the text length in bytes, allowing comparison between tokenizers with different compression rates
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects across STEM, the humanities, and social sciences
FLOPs: Floating Point Operations—a measure of compute cost