Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5 B-Parameter LLM

📝 Paper Summary

Small Language Models (SLMs) Multilingual LLM Pre-training

Gamayun is a 1.5B parameter multilingual model that overcomes the curse of multilinguality via a two-stage pre-training strategy: balanced multilingual alignment followed by high-quality English enrichment.

Core Problem

Training small (<2B) multilingual models from scratch is difficult because adding multiple languages often degrades performance in the primary language (the 'curse of multilinguality') and requires massive data usually reserved for larger models.

Why it matters:

Resource-constrained environments need efficient models but existing small models are 90%+ English, lacking true multilingual capability
Naive mixing of languages in limited-capacity models leads to competition for parameters, harming performance in high-resource languages like English and Russian
Organizations need full control over data for domain-specific applications, which distillation from larger proprietary models prevents

Concrete Example: When the authors trained two 750M models on Wikipedia—one English-only and one multilingual—the multilingual model showed higher perplexity and worse LAMBADA performance despite seeing the exact same number of English tokens, proving that additional languages acted as noise.

Key Novelty

Two-Stage Dynamic Data Mixing

Stage 1 (Alignment): Train on a balanced mix of 12 languages (approx. 37% English) to establish cross-lingual representations and align linguistic capabilities.
Stage 2 (Enrichment): Drastically increase the proportion of high-quality English data (approx. 70%) and domain-rich data (STEM, code) to transfer reasoning capabilities to other languages without losing multilingual proficiency.

Evaluation Highlights

Outperforms LLaMA3.2-1B (trained on 9T tokens) on all considered benchmarks despite using only 2.5T tokens.
Surpasses Qwen2.5-1.5B (18T tokens) on most English and multilingual tasks, trailing only in MMLU.
Achieves state-of-the-art results on the Russian MERA benchmark among models of comparable size (1-2B parameters).

Breakthrough Assessment

7/10

Strong practical contribution for low-resource multilingual training. Demonstrates that 2.5T tokens are sufficient for competitive performance if data mixing is dynamic, challenging the trend of massive token counts for small models.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling across 12 target languages with a focus on English and Russian

Inputs: Input text sequence in one of 12 supported languages

Outputs: Next token prediction probability distribution

Pipeline Flow

Tokenizer (LLaMA 3 based)
Embedding Layer
Transformer Block x24 (Attn + MLP)
Output Head

System Modules

Tokenizer

Convert text to token IDs with balanced fertility across languages

Model or implementation: LLaMA 3 tokenizer

Transformer Backbone

Process sequences via self-attention and feed-forward networks

Model or implementation: Deep-and-narrow LLaMA-like architecture (24 layers, 2048 hidden size)

Output Head

Predict next token

Model or implementation: Linear layer (tied input-output embeddings)

Novel Architectural Elements

Deep-and-narrow configuration (24 layers, 2048 hidden size) specifically optimized for the 1.5B parameter scale

Modeling

Base Model: Gamayun 1.5B (Custom LLaMA-like architecture)

Training Method: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Pre-training.

Formally: Standard autoregressive cross-entropy loss
Purpose: Alignment.

Formally: Length-normalized DPO (SimPO variant)

Adaptation: Full fine-tuning

Training Data:

Pre-training: 2.5T tokens total. Stage 1: ~37% English (balanced). Stage 2: ~70% English (high-quality enrichment).
Annealing: 400B tokens (STEM, code, synthetic).
Post-training: SlimOrca, ru_instruct, NuminaMath-CoT, MetaMathQA, synthetic MCF data.

Key Hyperparameters:

learning_rate: 1e-5 (SFT)
batch_size: 128 (effective, SFT)
max_sequence_length: 2048 (Base), 16384 (Extended)
+ 4 more
hidden_size: 2048
num_attention_heads: 16
num_layers: 24
rope_base_frequency: 500000 (after extension)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaMA3.2-1B: Gamayun uses significantly fewer tokens (2.5T vs 9T) yet outperforms on multilingual tasks and Russian specifically due to balanced mixing.
vs. Qwen2.5-1.5B: Gamayun matches or beats Qwen on many tasks despite 7x less training data (2.5T vs 18T), utilizing a two-stage mixing strategy rather than massive scale.
vs. mGPT/XGLM: Gamayun limits language count to 12 to reduce interference ('curse of multilinguality') while these models dilute capacity across dozens.

Limitations

Lower performance on deep STEM tasks compared to heavily specialized models like Qwen3
Limited to 12 specific languages; does not support massive multilinguality (100+ languages)
Requires synthetic data generation for Multiple Choice Format (MCF) tasks to overcome small model bias
Relies on existing English-centric high-quality data for the second stage transfer, assuming transferability

Reproducibility

Code availability is not explicitly provided in the text. Pre-training data is based on mC4 with custom filtering. Evaluation uses standard benchmarks (MMLU, MERA) and a synthetic MCF dataset generated via Gemma3 12B. Exact training compute/time is not reported.

📊 Experiments & Results

Evaluation Setup

Multilingual evaluation across 12 languages with focus on Russian and English, plus standard reasoning benchmarks.

Benchmarks:

MERA (Russian language capabilities (21 diverse tasks))
MMLU (General knowledge and reasoning (5-shot))
GSM8K (Mathematical reasoning)
LAMBADA (Language modeling / Next word prediction)

Metrics:

Accuracy
Perplexity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experimental validation of the 'curse of multilinguality' showing that naive mixing hurts the primary language.
LAMBADA	Perplexity	Lower (better)	Higher (worse)	Negative impact
Comparative performance against state-of-the-art small models on general and Russian benchmarks.
MMLU (MCF 5-shot)	Accuracy	24.9	26.3	+1.4
MMLU (MCF 5-shot)	Accuracy	40.2	45.0	+4.8
MERA	Score	Lower	Highest	Positive

Experiment Figures

Comparison of training curves between a fixed English-heavy baseline and the dynamic two-stage model.

Bilingual pre-training efficiency comparison (English-Russian vs. English-Chinese).

Main Takeaways

The 'curse of multilinguality' is real for small models: adding languages acts as noise if not managed, degrading performance even with fixed token counts for the primary language.
Two-stage training (Balanced → English-heavy) allows a model to catch up on English performance without losing multilingual gains.
Data quality in non-English languages is significantly lower; low-quality data actively harms the training efficiency of high-quality data.
Small models (<2B) struggle with Multiple Choice Format (MCF) not due to lack of knowledge but due to format sensitivity; targeted post-training fixes this.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture fundamentals (RoPE, SwiGLU)
Language model pre-training dynamics (scaling laws, data mixing)
Tokenizer fertility and vocabulary concepts

Key Terms

curse of multilinguality: The phenomenon where adding more languages to a model's training data degrades its performance on individual high-resource languages due to capacity competition

fertility: The average number of tokens a tokenizer produces per word; lower fertility means more efficient encoding

RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes better to longer sequences

SwiGLU: A specific activation function (Swish-Gated Linear Unit) used in modern LLMs for better performance

MERA: Multimodal Evaluation of Russian-language Architectures—a benchmark suite for evaluating Russian language models

MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge and problem solving across 57 subjects

ABF: Adjusted Base Frequency—a technique for extending the context window of RoPE-based models

RMSNorm: Root Mean Square Normalization—a normalization technique applied to layer inputs for stability

DPO: Direct Preference Optimization—an alignment method that optimizes a policy directly on preference data without a separate reward model