Characterizing the Expressivity of Fixed-Precision Transformer Language Models

📝 Paper Summary

Theoretical expressivity of neural networks Formal language theory

Fixed-precision transformers with soft attention and no positional encodings are exactly as expressive as first-order logic with two variables and past operators, corresponding to star-free languages recognized by partially ordered DFAs.

Core Problem

The theoretical expressive power of transformers is often analyzed under unrealistic idealizations (arbitrary precision, unique hard attention) that overestimate their capabilities compared to practical fixed-precision implementations.

Why it matters:

Current theory suggests transformers can recognize complex languages (e.g., Dyck languages) that practical models fail to learn, creating a gap between theory and practice
Understanding exact limits is crucial for knowing which tasks transformers fundamentally cannot solve regardless of data size or training time

Concrete Example: Theoretical models with arbitrary precision can recognize Dyck languages (balanced parentheses). However, the paper shows that a fixed-precision transformer fundamentally cannot recognize 'unambiguous polynomials' like RDP-1, failing to distinguish states that are not partially ordered.

Key Novelty

Exact logical characterization of fixed-precision soft-attention transformers

Establishes an equivalence between these transformers and LTL[P] (Linear Temporal Logic with only past operators)
Connects this logic to algebraic and automata-theoretic classes: R-trivial monoids and Partially Ordered DFAs (PODFAs)
Proves that soft attention under fixed precision has a bounded attention span, limiting it to 'local' decisions similar to moving a window over the input

Evaluation Highlights

Transformers achieve 100% accuracy on all languages within the LTL[P] class (e.g., LDP-1, LDP-2)
Transformers consistently fail (max 90.0% accuracy, mean 71.1%) on RDP-1, a language just outside the characterized class, despite it being star-free
Empirical probing confirms transformers learn linearly separable states for LTL[P] languages but fail to separate states for non-LTL[P] languages

Breakthrough Assessment

9/10

Provides the first exact characterization of transformers under the realistic 'fixed-precision soft-attention' assumption, closing a major gap between theoretical upper bounds and empirical reality.

⚙️ Technical Details

Problem Definition

Setting: Language recognition and modeling over finite strings

Inputs: Input string w of length N over alphabet Σ

Outputs: Binary acceptance (recognition) or probability distribution over Σ (modeling)

Pipeline Flow

Input Embedding (NoPE)
Transformer Layers (Soft Attention + Feedforward)
Classification/LM Head

System Modules

Input Embedding

Map symbols to vector representations without positional encodings

Model or implementation: Lookup Table

Self-Attention (Processing)

Aggregate information from past context using softmax

Model or implementation: Standard Scaled Dot-Product Attention (masked)

Feedforward Network (Processing)

Point-wise transformation of representations

Model or implementation: Two-layer MLP with LayerNorm

Novel Architectural Elements

Theoretical idealization: 'Fixed-precision Soft-Attention NoPE Transformer' defined as a formal computational model for proofs

Modeling

Base Model: Custom Transformer (Encoder-only for recognition, Decoder for LM)

Training Method: Supervised learning on synthetic formal languages

Objective Functions:

Purpose: Minimize prediction error.

Formally: Cross-entropy loss for classification/language modeling.

Training Data:

Synthetic datasets for specific formal languages (RDP-1, LDP-1, etc.)
Training on strings length ≤ 40
Testing on strings length 41-500

Key Hyperparameters:

learning_rate: 3 values tested (exact values not reported in text)
random_seeds: 5
precision: Single (32-bit)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Yang et al. [50]: Shifts from 'hard attention' to 'soft attention', restricting expressivity from full LTL to LTL[P] (past-only)
vs. Hahn [17]: Provides an *exact* characterization (upper and lower bound) rather than just impossibility results
vs. Chiang et al. [9]: Uses fixed precision rather than arbitrary precision, resulting in tighter, more realistic bounds (e.g., cannot recognize Dyck-1)

Limitations

Analysis assumes no positional encodings (NoPE); adding them changes expressivity
Focuses on strict future masking (causal), excluding bidirectional encoders (BERT-style)
Assumes fixed precision implies a specific behavior of softmax (bounded support), which may depend on implementation details of floating point

Reproducibility

Code: https://github.com/rycolab/transformer-expressivity

Code available at GitHub (link in abstract). Definitions of all formal languages provided in Appendix. Proofs fully detailed in appendices.

📊 Experiments & Results

Evaluation Setup

Length generalization tasks on formal languages

Benchmarks:

Counter languages (CNT) (Formal Language Recognition)
Regular languages (PARITY) (Formal Language Recognition)
Star-free languages (DYCK-(1,2)) (Formal Language Recognition)
Unambiguous polynomials (RDP-1) (Formal Language Recognition) [New]
Left-deterministic polynomials (LDP-1, LDP-2) (Formal Language Recognition) [New]

Metrics:

Accuracy (max and mean over 5 runs)
Statistical methodology: Reported mean and standard deviation across 5 random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Transformer recognition results align exactly with LTL[P] theory: perfect generalization on LTL[P] languages, failure on strictly more complex ones.
LDP-2	Accuracy (Max)	100.0	100.0	0.0
RDP-1	Accuracy (Max)	100.0	90.0	-10.0
PARITY	Accuracy (Max)	100.0	52.1	-47.9
DYCK-(1,2)	Accuracy (Max)	100.0	83.4	-16.6
Language modeling experiments confirm the same boundary: perfect per-token accuracy for LTL[P] languages, errors for others.
LDP-1	Accuracy (Mean)	Not reported in the paper	97.3	Not reported in the paper

Experiment Figures

Visualization of transformer hidden state representations projected to 2D for different DFA states

Venn diagram/Hierarchy of language classes and logic fragments

Main Takeaways

Fixed-precision soft-attention transformers without positional encodings are strictly limited to recognizing Left-Deterministic Polynomials (LTL[P]).
They cannot recognize general star-free languages (like RDP-1), contrary to theories assuming arbitrary precision or hard attention.
The 'fixed precision' assumption creates a 'bounded attention span' effect: soft attention cannot effectively attend to unbounded history, acting more like a finite window.
Probing hidden states reveals that for solvable languages (LDP-1), states are linearly separable, while for unsolvable ones (RDP-1), they are entangled.

📚 Prerequisite Knowledge

Prerequisites

Formal language theory (Regular expressions, Star-free languages)
Logic (First-order logic, Linear Temporal Logic)
Transformer architecture basics
Automata theory (DFA, Monoids)

Key Terms

LTL[P]: Linear Temporal Logic with only the Past operator—a logic that can reason about the past but not the future

PODFA: Partially Ordered Deterministic Finite Automata—DFAs where the state transition graph has no cycles except self-loops (cannot return to a previous state once left)

Star-free languages: Languages definable by regular expressions using union, concatenation, and complement, but WITHOUT the Kleene star (*)

R-trivial monoid: An algebraic structure corresponding to languages where the order of exploring suffixes essentially determines the state, characteristic of PODFAs

PFO2[<]: First-order logic with two variables and a 'less than' relation, restricted to looking at past positions

Fixed precision: Computation where all values are drawn from a finite set of floating-point numbers (e.g., standard 32-bit floats), implying limits on resolution

NoPE: No Positional Encodings—transformers that do not add explicit position information to embeddings

Soft attention: The standard attention mechanism using softmax, which assigns continuous weights to all tokens rather than selecting just one (hard attention)

Strict future masking: Attention mechanism where a position can only attend to itself and previous positions (causal masking)