Scaling Embeddings Outperforms Scaling Experts in Language Models

📝 Paper Summary

Sparse Large Language Models Model Scaling Laws

LongCat-Flash-Lite demonstrates that scaling embedding parameters via N-gram lookup tables is a superior alternative to scaling Mixture-of-Experts (MoE) parameters for high-sparsity, wide architectures, enabling massive parameter counts without computational explosion.

Core Problem

Mixture-of-Experts (MoE) architectures face diminishing returns and system-level bottlenecks (communication overhead, memory bandwidth) as expert counts increase, eventually hitting an efficiency saturation point.

Why it matters:

Continued scaling of LLMs to trillions of parameters requires maintaining modest inference latency, which standard MoE scaling struggles to sustain due to routing overheads.
Existing methods overlook the embedding layer as a sparse scaling dimension, despite its O(1) lookup complexity and potential for parameter expansion without computation explosion.

Concrete Example: In a standard MoE model, adding more experts increases communication costs during distributed training. The paper shows that at high sparsity ratios (e.g., >20 total-to-active parameters), simply adding more experts yields diminishing loss reductions compared to allocating those parameters to N-gram embeddings.

Key Novelty

Embedding Scaling as an Orthogonal Dimension to MoE

Allocates a massive portion of the parameter budget (>30B) to N-gram embeddings rather than Feed-Forward Network experts, utilizing a hash-based lookup that densifies information per token.
Identifies specific regimes (high sparsity, wide models) where embedding scaling achieves a better Pareto frontier than expert scaling.
Introduces 'Embedding Amplification' (scaling factors or LayerNorm) to prevent the massive embedding signal from being drowned out by attention outputs in deep networks.

Architecture

The structure of the N-gram Embedding layer.

Evaluation Highlights

LongCat-Flash-Lite (68.5B total params, ~3B activated) surpasses a parameter-equivalent MoE baseline on both training and validation losses.
Embedding scaling consistently outperforms expert scaling in wide models (1.3B activation size) even at sparsity ratios as high as 50:1.
Application of Embedding Amplification reduces training and validation loss by 0.02 consistently compared to vanilla initialization.

Breakthrough Assessment

8/10

Offers a distinct, validated alternative to the dominant MoE scaling paradigm. By proving embedding scaling is more efficient for wide, high-sparsity models, it opens a new avenue for efficient LLM design.

⚙️ Technical Details

Problem Definition

Setting: Pre-training Large Language Models (LLMs) under fixed activation parameter budgets while scaling total parameters.

Inputs: Token sequence t_i

Outputs: Augmented embedding vector e_i combining base and n-gram representations

Pipeline Flow

Input Tokenization
N-gram Embedding Lookup (Parallel Base + N-gram)
Transformer Layers (Attention + FFN/MoE)
Output Generation

System Modules

N-gram Embedding Layer

Generates dense vector representations by summing a base embedding and multiple hashed n-gram embeddings.

Model or implementation: Lookup tables with Polynomial Rolling Hash

N-gram Cache

Caches n-gram lookup results to reduce I/O overhead during inference.

Model or implementation: Custom cache implementation

Transformer Blocks

Standard deep learning processing layers.

Model or implementation: Longcat-Flash architecture

Novel Architectural Elements

Integration of massive N-gram embedding tables (>30B parameters) as a primary scaling mechanism replacing expert expansion.
Use of 'Embedding Amplification' (scaling factor or LayerNorm) specifically to balance signal norms between embeddings and attention layers.

Modeling

Base Model: LongCat-Flash-Lite

Training Method: Pre-training from scratch

Objective Functions:

Purpose: Minimize prediction error.

Formally: Standard Cross-Entropy Loss.

Trainable Parameters: 68.5B total parameters (with ~3B activated parameters)

Training Data:

Pre-trained on a corpus of 300B tokens

Key Hyperparameters:

n_gram_order_N: 3 to 5 (empirically optimal)
sub_tables_K: >= 2
vocabulary_size_factor: Avoid integer multiples of base vocabulary to minimize collisions
+ 1 more
embedding_scaling_factor: sqrt(D) (part of Embedding Amplification)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MoE: LongCat-Flash-Lite scales embeddings instead of FFN experts, showing better efficiency at high sparsity/width.
vs. Engram: While Engram identifies the U-shaped curve, this work implements a full-scale 68B model and introduces Embedding Amplification to solve signal drowning [not cited in paper as comparison but as concurrent work].
vs. PLE: This work focuses on vocabulary expansion via n-grams rather than layer-wise structural expansion.

Limitations

Diminishing returns in very deep models (>20 layers) where residual connections dilute embedding signals.
Performance degrades if the ratio of embedding parameters to total parameters becomes excessive (U-shaped curve).
Requires careful vocabulary size selection to avoid hash collision spikes (avoiding integer multiples of base vocab).

Reproducibility

Code: https://huggingface.co/meituan-longcat/LongCat-Flash-Lite

Model LongCat-Flash-Lite is open-sourced on Hugging Face. The paper details specific architectural choices (hash functions, collision analysis, N/K hyperparameters) but does not provide a direct link to a training code repository, only the model weights.

📊 Experiments & Results

Evaluation Setup

Pre-training scaling laws analysis and validation on distinct datasets.

Benchmarks:

Chinese Validation Set (Language Modeling Loss) [New]
English Validation Set (Language Modeling Loss) [New]

Metrics:

Training Loss
Validation Loss
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling experiments comparing N-gram Embedding strategies against parameter-equivalent MoE baselines across different activation budgets.
Validation Loss (English)	Loss	Not explicitly reported in the paper	Not explicitly reported in the paper	Not explicitly reported in the paper
Validation Loss	Loss	Not explicitly reported in the paper	Not explicitly reported in the paper	-0.02

Experiment Figures

Scaling curves comparing MoE baseline vs. N-gram Embedding at different base parameter ratios.

Analysis of vocabulary hit rates and hash collisions.

L2 norms of attention outputs vs. embedding outputs (identity branch) across layers.

Main Takeaways

Embedding scaling beats MoE scaling at high sparsity levels (high total-to-active parameter ratios).
Wider models (larger hidden state size) extend the regime where embedding scaling is superior; deeper models (>20 layers) shrink it.
Optimal N-gram configuration is N=3 to 5 and K>=2 sub-tables to balance context capture and hash collisions.
Specific vocabulary sizes (integer multiples of base) cause collision spikes and must be avoided.

📚 Prerequisite Knowledge

Prerequisites

Mixture-of-Experts (MoE) architecture
Scaling Laws for LLMs
Hash functions and collisions
Residual connections in Transformers

Key Terms

N-gram Embedding: A method to augment token representations by looking up embeddings for multi-token sequences (n-grams) using hashing, without a fixed vocabulary.

MoE: Mixture-of-Experts—a model architecture that activates only a subset of network parameters (experts) for each input, decoupling total capacity from compute cost.

Hash Collision: When different n-grams map to the same index in the embedding table due to the modulo operation, causing semantic ambiguity.

Sparsity Level: The ratio of total parameters to activated parameters during inference.

Pareto Frontier: The set of optimal trade-offs; here, the best possible loss achievable for a given computational cost or parameter budget.

Speculative Decoding: An inference technique where a smaller model drafts tokens that are verified by a larger model, speeding up generation.

Embedding Amplification: Techniques (scaling factors or LayerNorm) applied to embedding outputs to ensure their signal strength is comparable to attention outputs in the residual stream.

Polynomial Rolling Hash: A specific hash function used to map n-grams to indices efficiently.