← Back to Paper List

Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5 B-Parameter LLM

A Podolskiy, S Molokov, T Gerasin, M Titov…
arXiv, 12/2025 (2025)
Pretraining RL Reasoning Benchmark

📝 Paper Summary

Small Language Models (SLMs) Multilingual LLM Pre-training
Gamayun is a 1.5B parameter multilingual model that overcomes the curse of multilinguality via a two-stage pre-training strategy: balanced multilingual alignment followed by high-quality English enrichment.
Core Problem
Training small (<2B) multilingual models from scratch is difficult because adding multiple languages often degrades performance in the primary language (the 'curse of multilinguality') and requires massive data usually reserved for larger models.
Why it matters:
  • Resource-constrained environments need efficient models but existing small models are 90%+ English, lacking true multilingual capability
  • Naive mixing of languages in limited-capacity models leads to competition for parameters, harming performance in high-resource languages like English and Russian
  • Organizations need full control over data for domain-specific applications, which distillation from larger proprietary models prevents
Concrete Example: When the authors trained two 750M models on Wikipedia—one English-only and one multilingual—the multilingual model showed higher perplexity and worse LAMBADA performance despite seeing the exact same number of English tokens, proving that additional languages acted as noise.
Key Novelty
Two-Stage Dynamic Data Mixing
  • Stage 1 (Alignment): Train on a balanced mix of 12 languages (approx. 37% English) to establish cross-lingual representations and align linguistic capabilities.
  • Stage 2 (Enrichment): Drastically increase the proportion of high-quality English data (approx. 70%) and domain-rich data (STEM, code) to transfer reasoning capabilities to other languages without losing multilingual proficiency.
Evaluation Highlights
  • Outperforms LLaMA3.2-1B (trained on 9T tokens) on all considered benchmarks despite using only 2.5T tokens.
  • Surpasses Qwen2.5-1.5B (18T tokens) on most English and multilingual tasks, trailing only in MMLU.
  • Achieves state-of-the-art results on the Russian MERA benchmark among models of comparable size (1-2B parameters).
Breakthrough Assessment
7/10
Strong practical contribution for low-resource multilingual training. Demonstrates that 2.5T tokens are sufficient for competitive performance if data mixing is dynamic, challenging the trend of massive token counts for small models.
×