Theoretical Expressivity of TransformersFormal Language Theory for Neural Networks
Transformers using either hard or sparse attention can exactly represent any n-gram language model, with trade-offs between the number of heads, layers, and embedding dimensionality.
Core Problem
Most theoretical analyses of transformers focus on binary language acceptance (classifiers), which mismatches the definition of language models as probability distributions over strings.
Why it matters:
Analyzing LMs as recognizers introduces a category error, ignoring the probabilities essential to language modeling
Understanding probabilistic capacity is crucial for interpreting how LMs implement formal computation models
Existing results on binary recognition do not directly translate to the autoregressive probability generation used in modern LMs
Concrete Example:A standard analysis might ask if a transformer can classify a string as valid/invalid Dyck-k. However, an LM assigns a probability to every token sequence. This paper asks if a transformer can exactly match the conditional probability distribution P(w_t | w_{t-n+1}...w_{t-1}) of a classical n-gram model.
Key Novelty
Exact Probabilistic Simulation of n-gram LMs by Transformers
Constructs explicit weights showing how a transformer can copy previous symbols to the current position to form a history representation
Demonstrates that multiple mechanisms exist for this simulation: using n-1 attention heads in parallel OR n-1 layers in sequence
Extends the proof from idealized hard attention to differentiable sparse attention (sparsemax), bridging theory and practice
Architecture
A schematic of a transformer LM simulating a 4-gram LM using 3 attention heads in a single layer
Evaluation Highlights
Theorem 3.1: A single-layer hard attention transformer with n-1 heads can exactly represent any n-gram LM
Theorem 3.2: A single-head hard attention transformer with n-1 layers can exactly represent any n-gram LM
Theorem 4.1: A single-layer sparse attention transformer with n-1 heads can exactly represent any n-gram LM
Breakthrough Assessment
7/10
Solid theoretical foundation establishing a clear lower bound on transformer expressivity. While n-grams are simple, shifting the analysis from binary recognition to probabilistic modeling is a significant methodological improvement.
⚙️ Technical Details
Problem Definition
Setting: Probabilistic Language Modeling over an alphabet Σ
Inputs: Context string y_{<t} (prefix)
Outputs: Conditional probability distribution p(y_t | y_{<t}) over the next symbol
Pipeline Flow
Input Embedding (Symbol + Position)
Transformer Layer(s) (Attention + MLP)
Output Projection (Logits)
Softmax Normalization
System Modules
Positional Encoding
Encodes position t and symbols to allow attention to target relative positions (e.g., t-1, t-2)
Model or implementation: Deterministic function (rational or integer based)
Attention Mechanism
Retrieves symbols from the history (previous n-1 positions)
Model or implementation: Hard or Sparse Attention
MLP / Final Projection
Maps the retrieved history representation to the correct conditional probability distribution (logits)
Model or implementation: Feed-forward network or linear projection
Novel Architectural Elements
Constructive proof design where attention heads are explicitly assigned to 'look back' at specific offsets (t-1, t-2, etc.)
Modeling
Base Model: Transformer (Decoder-only)
Training Method: Theoretical construction (weights are set analytically, not trained)
Compute: Space complexity for context representation scales as O(n|Σ|). Representation of entries requires O(log |y|) bits.
Comparison to Prior Work
vs. Hahn (2020): Focuses on probabilistic equivalence (weak equivalence) rather than binary string acceptance
vs. Yao et al. (2021): Proves exact simulation of probabilistic n-gram models rather than bounded hierarchical languages
vs. Liu et al. (2023): Achieves simulation with fixed-depth transformers (independent of string length), whereas Liu et al. required depth scaling with input length for full FSA simulation
Limitations
Results are primarily for hard and sparse attention; soft attention (standard softmax) is discussed as an approximation but not proven for exact simulation
n-gram LMs are a very simple class of models; results establish a lower bound but do not characterize the full capacity of transformers
The constructions rely on specific positional encodings and may not reflect how trained transformers naturally organize representations
The paper provides theoretical proofs. A GitHub repository is linked for reference implementations/validations of the constructions.
📊 Experiments & Results
Evaluation Setup
Theoretical Proofs (Constructive)
Metrics:
Exact representation (Weak Equivalence)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
Transformers are at least as expressive as n-gram Language Models in the probabilistic setting
Trade-off identified: You can trade heads for layers. Simulation is possible with 1 layer & n-1 heads, OR n-1 layers & 1 head
Even a single-head, single-layer transformer can simulate an n-gram LM if allowed complex non-linear transformations and positional encodings scaling linearly with string length (Theorem 3.3)
Sparse attention (sparsemax) is sufficient for this simulation, suggesting differentiability does not hinder the ability to implement discrete n-gram logic
n-gram LM: A language model where the probability of the next token depends only on the previous n-1 tokens
hard attention: Attention mechanism where the weights are a one-hot vector (hardmax), selecting exactly one position to attend to
sparse attention: Attention mechanism using sparsemax, allowing weights to be zero for irrelevant items but differentiable
weak equivalence: Two LMs are weakly equivalent if they assign the exact same probability to every string in Σ*
positional encoding: Vectors added to token embeddings to provide information about the token's position in the sequence
contextual representation: The vector representation of a token after being processed by transformer layers, incorporating information from the context