byte premium: The inefficiency where non-Latin characters require more bytes (e.g., 3 bytes for Chinese) than Latin characters (1 byte) in UTF-8, inflating sequence lengths.
negative transfer: A phenomenon where learning one language degrades performance in another, often due to parameter interference or limited capacity.
typological distance: A measure of structural difference between languages based on features like word order, morphology, and phonology.
agglutinative: Languages that form words by stringing together many morphemes with distinct meanings (e.g., Turkish, Finnish), often leading to long words.
surprisal: A measure of unpredictability; the negative log probability of a token given its context.
BPE: Byte-Pair Encoding—a tokenization algorithm that iteratively merges the most frequent adjacent pairs of characters or bytes.
isochrony: The tendency of languages to maintain a constant rate of information transmission (bits per second) despite structural differences.
UNK: Unknown token—a placeholder used when a model encounters a character or subword not in its vocabulary.
morpheme: The smallest meaningful unit in a language (e.g., 'un-', 'break', '-able').
perplexity: A metric measuring how well a probability model predicts a sample; lower values indicate better prediction.