Transformers Can Represent $ n $-gram Language Models

📝 Paper Summary

Theoretical Expressivity of Transformers Formal Language Theory for Neural Networks

Transformers using either hard or sparse attention can exactly represent any n-gram language model, with trade-offs between the number of heads, layers, and embedding dimensionality.

Core Problem

Most theoretical analyses of transformers focus on binary language acceptance (classifiers), which mismatches the definition of language models as probability distributions over strings.

Why it matters:

Analyzing LMs as recognizers introduces a category error, ignoring the probabilities essential to language modeling
Understanding probabilistic capacity is crucial for interpreting how LMs implement formal computation models
Existing results on binary recognition do not directly translate to the autoregressive probability generation used in modern LMs

Concrete Example: A standard analysis might ask if a transformer can classify a string as valid/invalid Dyck-k. However, an LM assigns a probability to every token sequence. This paper asks if a transformer can exactly match the conditional probability distribution P(w_t | w_{t-n+1}...w_{t-1}) of a classical n-gram model.

Key Novelty

Exact Probabilistic Simulation of n-gram LMs by Transformers

Constructs explicit weights showing how a transformer can copy previous symbols to the current position to form a history representation
Demonstrates that multiple mechanisms exist for this simulation: using n-1 attention heads in parallel OR n-1 layers in sequence
Extends the proof from idealized hard attention to differentiable sparse attention (sparsemax), bridging theory and practice

Architecture

A schematic of a transformer LM simulating a 4-gram LM using 3 attention heads in a single layer

Evaluation Highlights

Theorem 3.1: A single-layer hard attention transformer with n-1 heads can exactly represent any n-gram LM
Theorem 3.2: A single-head hard attention transformer with n-1 layers can exactly represent any n-gram LM
Theorem 4.1: A single-layer sparse attention transformer with n-1 heads can exactly represent any n-gram LM

Breakthrough Assessment

7/10

Solid theoretical foundation establishing a clear lower bound on transformer expressivity. While n-grams are simple, shifting the analysis from binary recognition to probabilistic modeling is a significant methodological improvement.

⚙️ Technical Details

Problem Definition

Setting: Probabilistic Language Modeling over an alphabet Σ

Inputs: Context string y_{<t} (prefix)

Outputs: Conditional probability distribution p(y_t | y_{<t}) over the next symbol

Pipeline Flow

Input Embedding (Symbol + Position)
Transformer Layer(s) (Attention + MLP)
Output Projection (Logits)
Softmax Normalization

System Modules

Positional Encoding

Encodes position t and symbols to allow attention to target relative positions (e.g., t-1, t-2)

Model or implementation: Deterministic function (rational or integer based)

Attention Mechanism

Retrieves symbols from the history (previous n-1 positions)

Model or implementation: Hard or Sparse Attention

MLP / Final Projection

Maps the retrieved history representation to the correct conditional probability distribution (logits)

Model or implementation: Feed-forward network or linear projection

Novel Architectural Elements

Constructive proof design where attention heads are explicitly assigned to 'look back' at specific offsets (t-1, t-2, etc.)

Modeling

Base Model: Transformer (Decoder-only)

Training Method: Theoretical construction (weights are set analytically, not trained)

Compute: Space complexity for context representation scales as O(n|Σ|). Representation of entries requires O(log |y|) bits.

Comparison to Prior Work

vs. Hahn (2020): Focuses on probabilistic equivalence (weak equivalence) rather than binary string acceptance
vs. Yao et al. (2021): Proves exact simulation of probabilistic n-gram models rather than bounded hierarchical languages
vs. Liu et al. (2023): Achieves simulation with fixed-depth transformers (independent of string length), whereas Liu et al. required depth scaling with input length for full FSA simulation

Limitations

Results are primarily for hard and sparse attention; soft attention (standard softmax) is discussed as an approximation but not proven for exact simulation
n-gram LMs are a very simple class of models; results establish a lower bound but do not characterize the full capacity of transformers
The constructions rely on specific positional encodings and may not reflect how trained transformers naturally organize representations

Reproducibility

Code: https://github.com/rycolab/transformer-ngrams

The paper provides theoretical proofs. A GitHub repository is linked for reference implementations/validations of the constructions.

📊 Experiments & Results

Evaluation Setup

Theoretical Proofs (Constructive)

Metrics:

Exact representation (Weak Equivalence)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Transformers are at least as expressive as n-gram Language Models in the probabilistic setting
Trade-off identified: You can trade heads for layers. Simulation is possible with 1 layer & n-1 heads, OR n-1 layers & 1 head
Even a single-head, single-layer transformer can simulate an n-gram LM if allowed complex non-linear transformations and positional encodings scaling linearly with string length (Theorem 3.3)
Sparse attention (sparsemax) is sufficient for this simulation, suggesting differentiability does not hinder the ability to implement discrete n-gram logic

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, MLP, Positional Encoding)
n-gram Language Models
Hard vs. Soft vs. Sparse Attention mechanisms

Key Terms

n-gram LM: A language model where the probability of the next token depends only on the previous n-1 tokens

hard attention: Attention mechanism where the weights are a one-hot vector (hardmax), selecting exactly one position to attend to

sparse attention: Attention mechanism using sparsemax, allowing weights to be zero for irrelevant items but differentiable

weak equivalence: Two LMs are weakly equivalent if they assign the exact same probability to every string in Σ*

positional encoding: Vectors added to token embeddings to provide information about the token's position in the sequence

contextual representation: The vector representation of a token after being processed by transformer layers, incorporating information from the context