Memory Augmented Language Models through Mixture of Word Experts

📝 Paper Summary

Memory recall Sparse memory QA

MoWE decouples model capacity from compute by using a massive number of word-specific experts routed via a static, knowledge-rich vocabulary, effectively acting as an integrated sparse memory.

Core Problem

Dense language models require proportionally more FLOPs to scale parameters for knowledge-intensive tasks, while standard MoEs struggle with routing efficiency and lack semantic specialization in their experts.

Why it matters:

Increasing parameter count improves world knowledge retention but drastically increases training and inference costs
Existing memory-augmented models often require complex external retrieval mechanisms (like k-NN search) or specialized training losses
Standard MoE routing doesn't guarantee that experts specialize in specific concepts, limiting their ability to act as interpretable memory key-value pairs

Concrete Example: In TriviaQA, answering 'What is Neptune's main satellite?' requires retrieving the specific entity 'Triton'. A standard dense model must process this with all parameters. MoWE routes the token 'Neptune' to a specific expert capable of recalling 'Triton', while skipping irrelevant experts, achieving T5-XXL performance with T5-Large compute.

Key Novelty

Mixture-of-Word-Experts (MoWE)

Replaces dynamic learned routing with a fixed routing function based on a large 'knowledge-rich' vocabulary (~1M tokens derived from Wikidata/C4)
Assigns specific words/entities to specific Feed-Forward Network (FFN) experts, encouraging them to act as static key-value memory slots for those concepts
Uses a hierarchical lookup (Frequency Bucketing + Expert Blocks) to handle massive expert counts (up to 1M) efficiently despite Zipfian word distributions

Architecture

Schematic of the MoWE Layer replacing the FFN layer in Transformer blocks.

Evaluation Highlights

MoWE-Base outperforms T5-XL on TriviaQA (39.4 vs 36.0 Exact Match) while using ~8.6x fewer FLOPs during training
MoWE-Large achieves 44.8 EM on TriviaQA, outperforming T5-XXL (42.9 EM) with significantly faster training (6.6x speedup)
Outperforms standard MoE baselines (GShard Top-2) on knowledge-intensive tasks (e.g., +3.2 EM on TriviaQA vs Top-2 MoE)

Breakthrough Assessment

7/10

Strong empirical results on efficiency vs. performance trade-offs for knowledge tasks. The fixed semantic routing is a clever simplification of MoE, though limited by the static vocabulary constraint.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering and general language understanding tasks

Inputs: Natural language sequence (e.g., questions)

Outputs: Generated text answers

Pipeline Flow

Input Tokenization (Standard T5 tokenizer)
Routing Tokenization (Map tokens to auxiliary knowledge vocabulary via hash)
Hierarchical Routing (Map Routing ID → Frequency Bucket → Expert Block → Expert)
Expert Processing (Apply specific FFN to token)
Output Generation (Transformer decoder)

System Modules

Routing Tokenizer (Routing & Selection)

Maps input tokens to a large auxiliary vocabulary (~1M entries) derived from Wikidata/C4

Model or implementation: Hash-based lookup (online approximation)

Frequency Bucketer (Routing & Selection)

Routes tokens to buckets based on frequency to handle load balancing

Model or implementation: Static mapping

Expert Block Selector (Routing & Selection)

Routes token to specific hardware/tensor block within a bucket

Model or implementation: Static mapping

Word Expert FFN

Process the token representation using knowledge-specific parameters

Model or implementation: Feed-Forward Network (FFN)

Novel Architectural Elements

Fixed Routing Function: Deterministic routing based on token ID in a massive auxiliary vocabulary (vs. learned gating)
Hierarchical Routing Strategy: Bucketing → Block → Expert hierarchy to manage 32K-1M experts efficiently on TPU hardware
Word-Specific Experts: Experts tied explicitly to vocabulary words, acting as discrete memory slots

Modeling

Base Model: T5.1.1 architecture backbone

Training Method: Span-masking pretraining (C4) followed by fine-tuning with Frozen Experts

Objective Functions:

Purpose: Standard language modeling.

Formally: Cross-entropy loss on target tokens.

Trainable Parameters: Only non-expert parameters (attention, layer norms, dense FFNs) are trained during fine-tuning; experts are frozen.

Training Data:

Pretraining: C4 dataset (~1 trillion tokens)
Fine-tuning: TriviaQA, WebQuestions, Natural Questions, FEVER, SuperGLUE

Key Hyperparameters:

batch_size: 2048
input_sequence_length: 512
training_steps: 1M (pretraining)
+ 4 more
learning_rate_finetuning: 2e-4
dropout: 0.05
expert_count: 32,768 (standard config)
auxiliary_vocab_size: 1,000,000

Compute: Pretraining: 64 v3 TPUs. MoWE-Base is ~4.3x faster to train than T5-XL; MoWE-Large is ~6.6x faster than T5-XXL.

Comparison to Prior Work

vs. T5: MoWE uses sparse layers with massive expert counts to decouple capacity from FLOPs.
vs. GShard/Switch: MoWE uses fixed, vocabulary-based routing instead of learned gating; supports order-of-magnitude more experts (32K+ vs 128).
vs. EaE/TOME: MoWE integrates memory directly into FFN layers via routing, avoiding external K-NN search or separate memory banks.
+ 1 more
vs. Hash Layers [Roller et al.]: MoWE groups tokens semantically via knowledge vocabulary, whereas Hash Layers bucket random token IDs together [not cited in paper].

Limitations

Fixed routing does not adapt to context; polysemous words always route to the same expert regardless of meaning
Not tested on decoder-only architectures (e.g., Llama/GPT styles)
Requires constructing a large auxiliary vocabulary, which may be language/domain dependent
Current implementation uses suboptimal look-ahead window for routing tokenization

Reproducibility

Code availability is not provided. Pretraining uses public C4 dataset. Routing vocabulary construction (Wikidata + C4 frequency) is described but the exact vocabulary file is not linked.

📊 Experiments & Results

Evaluation Setup

Pretraining on C4, then fine-tuning on downstream tasks. Experts frozen during fine-tuning.

Benchmarks:

TriviaQA (Closed-book Question Answering)
WebQuestions (Closed-book Question Answering)
Natural Questions (Closed-book Question Answering)
FEVER (Fact Verification)
SuperGLUE (General Language Understanding)

Metrics:

Exact Match (EM)
Accuracy
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MoWE significantly outperforms dense T5 baselines with comparable FLOPs on knowledge-intensive tasks.
TriviaQA	Exact Match (EM)	24.2	39.4	+15.2
TriviaQA	Exact Match (EM)	36.0	39.4	+3.4
WebQuestions	Exact Match (EM)	29.5	38.8	+9.3
MoWE outperforms standard learned-routing MoE models on knowledge tasks while maintaining parity on general NLU.
TriviaQA	Exact Match (EM)	36.2	39.4	+3.2
SuperGLUE	Avg Score	83.5	83.5	0.0

Experiment Figures

FLOPs vs. Performance (TriviaQA) for MoWE vs T5 models.

Ablation on Number of Experts vs Performance.

Main Takeaways

MoWE successfully decouples model capacity from compute, matching T5-XXL performance with MoWE-Large (similar FLOPs to T5-Large) on knowledge tasks.
Fixed, vocabulary-based routing is highly effective for knowledge retrieval, functioning as a sparse memory.
Skipping experts for knowledge-rich words (routing IDs > 32K) causes massive performance drops (TriviaQA EM 35.1 -> 25.6), confirming experts store specific factual knowledge.
Increasing routing vocabulary size consistently improves performance (e.g., +2 F1 on TriviaQA when >262K), validating the benefit of fine-grained semantic routing.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically Feed-Forward Networks)
Mixture-of-Experts (MoE) routing mechanisms
Sparse model training challenges (load balancing, communication overhead)
Knowledge-intensive NLP benchmarks (TriviaQA, NQ)

Key Terms

MoE: Mixture-of-Experts—a neural architecture where different parts of the network (experts) activate for different inputs to scale parameters without scaling compute

FFN: Feed-Forward Network—the dense layers within a Transformer block where factual knowledge is hypothesized to be stored

FLOPs: Floating Point Operations—a measure of computational cost

SPMD: Single Program, Multiple Data—a parallel programming technique used to train large models across many devices

Zipfian distribution: A distribution where a few items (words) occur very frequently while most occur rarely, creating load-balancing challenges for word-specific experts

routing vocabulary: A specialized auxiliary vocabulary (distinct from the tokenizer vocabulary) used solely to determine which expert handles a token

Exact Match (EM): A metric measuring the percentage of predictions that match the ground truth answer exactly

SuperGLUE: A benchmark suite of difficult language understanding tasks

T5: Text-to-Text Transfer Transformer—a widely used encoder-decoder language model

knowledge-rich vocabulary: A vocabulary constructed from Wikidata entities and relations, prioritized by frequency, to ensure experts specialize in semantic concepts

inference latency: The time it takes for a model to generate a response