TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

📝 Paper Summary

Multilingual Large Language Models Low-resource language adaptation

TildeOpen LLM achieves better performance on underrepresented European languages by combining data upsampling with a curriculum that alternates between uniform and natural data distributions during training.

Core Problem

Most Large Language Models (LLMs) are heavily biased toward English and Western European languages, causing poor performance and higher inference costs for low-resource Balto-Slavic and Finno-Ugric languages.

Why it matters:

Nearly 170 million Europeans have their first language poorly represented in existing foundation models, threatening digital linguistic equity.
Commercial models often produce errors in 1 out of 6 words for languages like Lithuanian or Latvian due to insufficient training data.
Inefficient tokenization for these languages inflates inference costs and reduces the effective context window compared to English.

Concrete Example: When producing free-form text in Balto-Slavic languages, mainstream models like Llama 3 exhibit linguistic errors in approximately one out of every six words, whereas TildeOpen reduces these errors significantly.

Key Novelty

Equitable Curriculum Learning and Tokenization

Uses a 3-phase curriculum schedule: starts with uniform language distribution (high diversity), switches to natural distribution (high volume) in the middle, and returns to uniform distribution at the end.
Designs a tokenizer that ensures the same content translates into a similar number of tokens across all 34 focus languages, preventing cost penalties for low-resource languages.

Evaluation Highlights

Produces up to 10x fewer linguistic errors per 100 words than Gemma 2 for lower-resource languages in human evaluations.
Outperforms similarly sized open-weight models in text generation and comprehension for Baltic, Finno-Ugric, and Slavic languages despite using 2-4.5x less training compute.
Achieves equitable tokenization where focus languages require roughly the same number of tokens for translated content, unlike standard multilingual tokenizers.

Breakthrough Assessment

7/10

Strong practical contribution for linguistic equity in Europe. Demonstrates that curriculum learning can substitute for sheer data volume in low-resource settings, though the architecture itself is standard Llama-3.

⚙️ Technical Details

Problem Definition

Setting: Training a foundational Large Language Model (LLM) on a multilingual corpus comprising 34 European languages with highly imbalanced data availability.

Inputs: Text sequences from 34 languages (Web data, code, parallel corpora)

Outputs: Next-token prediction (continuation of text)

Pipeline Flow

Input Text → Tokenizer → Embedding → Transformer Blocks → Output Head → Next Token

System Modules

Tokenizer

Converts input text into token IDs with equitable representation for focus languages

Model or implementation: SentencePiece with BPE (vocab size 131,072)

Transformer Backbone

Processes token embeddings to generate contextual representations

Model or implementation: Llama 3 architecture (30B parameters, 60 layers)

Output Head

Projects final hidden states to vocabulary logits

Model or implementation: Linear layer (no bias)

Modeling

Base Model: Llama 3 architecture (30B parameters)

Training Method: Pre-training from scratch using curriculum learning

Objective Functions:

Purpose: Predict the next token in the sequence.

Formally: Standard autoregressive cross-entropy loss.

Training Data:

Total 2 trillion tokens from Web data (MADLAD-400, HPLT, Cultura-X, FineWeb 2, Common Pile)
Specialist sources: The Stack (code), MathPile, Tezaurs
Parallel data from OPUS (synthetic documents with source/target pairs)
Data upsampling for low-resource languages up to 2.5x

Key Hyperparameters:

learning_rate: 1.8e-4 (peak), then reduced to 1.6e-4 due to spikes
batch_size: 576 samples (approx 4.7M tokens)
sequence_length: 8,192 tokens
+ 5 more
weight_decay: 0.1
optimizer: Adam (beta1=0.9, beta2=0.95, epsilon=1e-8)
gradient_clipping: 0.4 (reduced), 1.0 (cooldown)
warmup_steps: 2,000
schedule: Trapezoidal with linear warmup, constant phase, and (1-sqrt)-cooldown

Compute: 1.5M GPU hours on 768 AMD MI250x GPUs (LUMI supercomputer)

Comparison to Prior Work

vs. EuroLLM: TildeOpen targets 34 languages with explicit equity vs. EuroLLM's focus on major Western languages (En, Fr, De, Es, It, Pt)
vs. Gemma 2: TildeOpen uses curriculum learning to balance low-resource performance vs. Gemma's data-rich training
vs. Llama 3: TildeOpen has a tokenizer optimized for Balto-Slavic languages vs. Llama 3's English-centric tokenizer
+ 1 more
vs. BLOOM [not cited in paper]: BLOOM also used a specific sampling strategy but TildeOpen introduces a dynamic curriculum (Uniform-Natural-Uniform) rather than a static mix

Limitations

Trained on 2T tokens, significantly fewer than comparable 30B models (often trained on >8T tokens)
Performance on tasks requiring parametric knowledge is only 'on par' with similar models, not superior
Curriculum schedule is heuristic-based; optimal phase durations are not fully explored
Evaluation mainly focuses on European languages; global low-resource languages are not addressed

Reproducibility

Code: https://huggingface.co/TildeAI/TildeOpen-30b

Publicly available: Model weights and tokenizer at https://huggingface.co/TildeAI/TildeOpen-30b. Data processing: Extensive details on URL filtering, deduplication (Onion), and heuristic filtering provided. Missing: The exact training dataset itself is not released, only the sources and methodology.

📊 Experiments & Results

Evaluation Setup

Evaluation on text generation, comprehension, and parametric knowledge across multiple multilingual benchmarks.

Benchmarks:

Human Evaluation (Linguistic error analysis) [New]
Text Generation Benchmarks (Generation quality)
Comprehension Benchmarks (Understanding)

Metrics:

Error rate (errors per 100 words)
Generation quality scores (implied, specific metric names not explicitly listed in snippet beyond 'performance')
Comprehension scores
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Linguistic Quality	Errors per 100 words	16.6	1.6	-15.0
FLORES 200 (Focus Languages)	Tokens per word/content (Qualitative)	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

Curriculum learning with a 'sandwich' strategy (Uniform -> Natural -> Uniform) effectively balances performance for low-resource languages without degrading high-resource performance.
Custom tokenizer training is critical for equity: ensuring uniform token density across languages reduces inference cost and computation disparities for Balto-Slavic languages.
Model surpasses existing open-weight models in generation/comprehension for Baltic, Finno-Ugric, and Slavic languages despite a smaller training budget (2T tokens).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture basics (Llama 3 specifics like RoPE, GQA)
Tokenization algorithms (BPE, SentencePiece)
Curriculum learning concepts

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

RoPE: Rotary Position Embeddings—a method for encoding positional information in transformers by rotating query and key vectors in space

GQA: Group Query Attention—an efficiency technique where multiple query heads share a single key-value head to reduce memory usage

SwiGLU: Swish-Gated Linear Unit—an activation function variant used in feed-forward layers for better performance

BPE: Byte Pair Encoding—a tokenization algorithm that iteratively merges the most frequent pair of bytes or characters

Curriculum Learning: A training strategy where the difficulty or distribution of training data is meaningfully ordered or scheduled over time

Upsampling: Artificially increasing the frequency of data from underrepresented classes (here, low-resource languages) during training

COMET: A neural framework for training machine translation evaluation models that correlates well with human judgment

RMSNorm: Root Mean Square Normalization—a normalization technique that re-scales inputs based on their root mean square, simpler than LayerNorm

Onion: ONe Instance ONly—a deduplication tool that removes documents containing high ratios of duplicate n-grams

FlashAttention: An algorithm that speeds up attention computation and reduces memory usage (implied by Llama 3 architecture context, though specific kernel not detailed)

Focus languages: The 17 specific European languages (e.g., Bosnian, Estonian, Ukrainian) targeted for equitable performance in this model