DEPT: Decoupled Embeddings for Pre-training Language Models

📝 Paper Summary

Language Model Pre-training Efficient Training Federated Learning Multilingual/Multi-domain Learning

DEPT decouples token embeddings from the transformer body during pre-training, enabling efficient learning on heterogeneous data sources without a shared vocabulary while improving model generalization and plasticity.

Core Problem

Training language models on heterogeneous data mixtures (different languages/domains) causes negative interference and the 'curse of multilinguality' due to capacity contention and vocabulary dilution.

Why it matters:

Standard multilingual models allocate huge portions of parameters (40-80%) to shared embeddings, leading to under-representation of low-resource languages.
Existing methods require expensive hyperparameter tuning (like temperature sampling) and shared vocabularies that may not fit all data sources optimally.
Communication costs in distributed or federated settings are prohibitively high when syncing massive embedding matrices.

Concrete Example: While English might need ~150k tokens, a multilingual model forces hundreds of languages to share 250k tokens, causing 'vocabulary dilution' where languages compete for representation, degrading performance for low-resource languages like Swahili or Urdu.

Key Novelty

Decoupled Embeddings for Pre-Training (DEPT)

Isolate data sources as independent 'silos' that train a shared transformer body but maintain local, specialized embeddings (token and position) tailored to their specific vocabulary.
Train iteratively like Federated Learning: sending the transformer body to sources, updating locally with custom embeddings, and aggregating only the body (or trimmed embeddings) to reduce communication.
Leverage the insight that transformer bodies are largely vocabulary-agnostic, allowing them to learn abstract representations even when input/output layers are disjoint across sources.

Architecture

Comparison of Standard Pre-training pipeline vs. DEPT pipeline variants (GLOB, TRIM, SPEC).

Evaluation Highlights

Reduces communication costs by up to 714× and embedding memory by 80% (409M parameters) for billion-scale multilingual models.
Improves average validation perplexity by up to 20% compared to standard distributed baselines on The Pile and MC4 datasets.
Outperforms standard pre-training on downstream tasks (MNLI, RACE, STSB) regardless of initialization strategy.

Breakthrough Assessment

8/10

Significant efficiency gains (orders of magnitude in communication) and improved generalization without shared vocabularies. Challenges the standard paradigm that pre-training requires a unified global tokenizer.

⚙️ Technical Details

Problem Definition

Setting: Pre-training a Transformer Language Model on heterogeneous data sources S distributed across different silos.

Inputs: Text sequences from distinct data sources (domains or languages) with potentially unique vocabularies.

Outputs: Next-token probability distributions specific to the local vocabulary of the data source.

Pipeline Flow

Data Source Selection (Sample subset of sources)
Inner Optimization (Local training on source)
Outer Optimization (Aggregation of updates)

System Modules

Inner Loop (Local Training)

Train global transformer body + local embeddings on source-specific data

Model or implementation: Transformer Decoder (125M - 1.3B params)

Outer Loop (Aggregation)

Aggregates updates from sampled data sources to update the global model

Model or implementation: Parameter Averaging (FedAvg-style)

Novel Architectural Elements

DEPT-SPEC: Fully decoupled architecture where embedding and positional matrices are never shared or aggregated, allowing disjoint vocabularies per source.
DEPT-TRIM: Transmission of 'trimmed' embedding gradients corresponding only to the local vocabulary subset, mathematically projected back to global space during aggregation.

Modeling

Base Model: Decoder-only Transformers (125M to 1.3B parameters, 12-24 blocks)

Training Method: Multi-phase adaptive pre-training (for DEPT-SPEC/ACT)

Objective Functions:

Purpose: Minimize negative log-likelihood of next token.

Formally: Standard autoregressive language modeling loss L(θ, ϕ_k) on local data source k.

Training Data:

Multi-domain: The Pile (16 subsets selected)
Multilingual: MC4 (mix of high, medium, low resource languages)

Key Hyperparameters:

learning_rate: 6e-4 (125M), 2e-4 (1.3B)
batch_size: 256 (global effective)
max_steps: 20k - 100k (depending on setting)
+ 3 more
sequence_length: 1024
optimizer: AdamW (Inner), FedAvg (Outer)
warmup_steps: 1000

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard DDP: DEPT updates embeddings locally or not at all, aggregating less frequently to save communication.
vs. ACT: DEPT does not artificially reset embeddings but maintains persistent local specializations, avoiding waste of compute cycles.
vs. XLM/mBERT [not cited in paper]: DEPT does not require temperature-based sampling heuristics or a single unified vocabulary during pre-training.

Limitations

DEPT-SPEC requires a separate 'continued pre-training' phase to learn a global embedding matrix for inference if a unified model is desired.
Performance gains might be partly due to ensemble-like effects of aggregation noise.
Requires careful management of disparate vocabularies if valid cross-source inference is needed without the second training phase.

Reproducibility

Code availability is not provided. Datasets (The Pile, MC4) are public. Hyperparameters are detailed in Appendix A. Baselines include standard distributed training and 'Active Forgetting'.

📊 Experiments & Results

Evaluation Setup

Language Modeling (Perplexity) and Downstream NLU Tasks

Benchmarks:

The Pile (Multi-domain Language Modeling)
MC4 (Multilingual Language Modeling)
Downstream NLU (NLI, QA, Similarity (MNLI, RACE, STSB, SST-2))

Metrics:

Perplexity (PPL)
Accuracy
Pearson Correlation
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generalization results on MC4 (Multilingual) show DEPT variants significantly outperforming standard baselines in perplexity.
MC4 (Validation)	Perplexity	39.9	33.0	-6.9
Generalization results on The Pile (Multi-domain) show consistent perplexity improvements.
The Pile (Validation)	Perplexity	15.7	13.3	-2.4
Downstream task performance after continued pre-training demonstrates DEPT's transformer body quality.
RACE	Accuracy	32.6	34.5	+1.9
MNLI	Accuracy	38.7	41.6	+2.9
Efficiency metrics highlighting communication reductions.
Multilingual 1.3B Model	Communication Cost Reduction	1.0	714.0	714x

Experiment Figures

Robustness analysis plotting Perplexity, Activation Norm, and Model Norm over training steps.

Plasticity evaluation showing adaptation curves (perplexity vs steps) when transferring to new languages (Hindi, German) or specific domains (Swahili).

Main Takeaways

DEPT-SPEC variants (fully decoupled) allow scaling vocabulary size linearly with data sources without memory penalty, unlike standard models where vocab size is fixed.
Transformer bodies trained via DEPT show higher plasticity, adapting to new languages (e.g., Hindi, German) faster than standard pre-trained bodies.
DEPT is robust to training instability, maintaining lower activation norms compared to baselines which often diverge or require lower learning rates.
Even without shared embeddings during pre-training, DEPT models converge to better solutions than standard joint training, suggesting embeddings and body can be effectively decoupled.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (embeddings vs. body)
Distributed Data Parallelism (DDP)
Federated Learning (FedAvg)
Tokenization (BPE, Unigram)
Perplexity

Key Terms

Transformer Body: The hidden layers of the model (attention and FFN blocks), excluding the input/output embedding matrices.

Vocabulary Dilution: When a fixed-size vocabulary is shared across many languages, reducing the representational capacity available for any single language.

Curse of Multilinguality: The phenomenon where adding more languages to a model with fixed capacity degrades performance on individual languages.

Negative Interference: When training on diverse data sources causes the model to perform worse than if trained on sources individually due to conflicting gradients or capacity contention.

FedAvg: Federated Averaging—an algorithm where a central server aggregates model updates from multiple clients (data sources) that train locally.

Perplexity: A metric measuring how well a probability model predicts a sample; lower values indicate better prediction.

UNIGRAM-CE: Unigram Cross-Entropy—a measure of the intrinsic difficulty of modeling a dataset based purely on token frequencies, used to gauge tokenizer quality.