curse of multilinguality: The phenomenon where adding more languages to a model's training data degrades its performance on individual high-resource languages due to capacity competition
fertility: The average number of tokens a tokenizer produces per word; lower fertility means more efficient encoding
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes better to longer sequences
SwiGLU: A specific activation function (Swish-Gated Linear Unit) used in modern LLMs for better performance
MERA: Multimodal Evaluation of Russian-language Architectures—a benchmark suite for evaluating Russian language models
MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge and problem solving across 57 subjects
ABF: Adjusted Base Frequency—a technique for extending the context window of RoPE-based models
RMSNorm: Root Mean Square Normalization—a normalization technique applied to layer inputs for stability
DPO: Direct Preference Optimization—an alignment method that optimizes a policy directly on preference data without a separate reward model