← Back to Paper List

DEPT: Decoupled Embeddings for Pre-training Language Models

A Iacob, L Sani, M Kurmanji, WF Shen, X Qiu, D Cai…
University of Maryland, University of North Carolina at Chapel Hill, Max Planck Institute for Intelligent Systems, Tübingen
arXiv, 10/2024 (2024)
Pretraining Memory

📝 Paper Summary

Language Model Pre-training Efficient Training Federated Learning Multilingual/Multi-domain Learning
DEPT decouples token embeddings from the transformer body during pre-training, enabling efficient learning on heterogeneous data sources without a shared vocabulary while improving model generalization and plasticity.
Core Problem
Training language models on heterogeneous data mixtures (different languages/domains) causes negative interference and the 'curse of multilinguality' due to capacity contention and vocabulary dilution.
Why it matters:
  • Standard multilingual models allocate huge portions of parameters (40-80%) to shared embeddings, leading to under-representation of low-resource languages.
  • Existing methods require expensive hyperparameter tuning (like temperature sampling) and shared vocabularies that may not fit all data sources optimally.
  • Communication costs in distributed or federated settings are prohibitively high when syncing massive embedding matrices.
Concrete Example: While English might need ~150k tokens, a multilingual model forces hundreds of languages to share 250k tokens, causing 'vocabulary dilution' where languages compete for representation, degrading performance for low-resource languages like Swahili or Urdu.
Key Novelty
Decoupled Embeddings for Pre-Training (DEPT)
  • Isolate data sources as independent 'silos' that train a shared transformer body but maintain local, specialized embeddings (token and position) tailored to their specific vocabulary.
  • Train iteratively like Federated Learning: sending the transformer body to sources, updating locally with custom embeddings, and aggregating only the body (or trimmed embeddings) to reduce communication.
  • Leverage the insight that transformer bodies are largely vocabulary-agnostic, allowing them to learn abstract representations even when input/output layers are disjoint across sources.
Architecture
Architecture Figure Figure 2
Comparison of Standard Pre-training pipeline vs. DEPT pipeline variants (GLOB, TRIM, SPEC).
Evaluation Highlights
  • Reduces communication costs by up to 714× and embedding memory by 80% (409M parameters) for billion-scale multilingual models.
  • Improves average validation perplexity by up to 20% compared to standard distributed baselines on The Pile and MC4 datasets.
  • Outperforms standard pre-training on downstream tasks (MNLI, RACE, STSB) regardless of initialization strategy.
Breakthrough Assessment
8/10
Significant efficiency gains (orders of magnitude in communication) and improved generalization without shared vocabularies. Challenges the standard paradigm that pre-training requires a unified global tokenizer.
×