Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking

📝 Paper Summary

Mechanistic Interpretability Chain-of-Thought (CoT) Reasoning

Mechanistic analysis reveals that Transformers with Chain-of-Thought learn robust state tracking algorithms by implementing Finite State Automata via late-layer MLP neurons, distinguishing them from standard Transformers and implicit CoT models.

Core Problem

Standard Transformers and state-space models fail to learn state tracking for complex groups (like A5) or generalize to arbitrary lengths, often learning statistical shortcuts rather than true world models.

Why it matters:

Theoretical expressiveness does not guarantee practical learnability; models often fail to converge on tasks they can theoretically represent
Understanding if CoT actually recovers a world model (FSA) or just learns shortcuts is crucial for trusting generative models in complex reasoning
State tracking is foundational for downstream tasks like entity tracking, navigation, and mathematical reasoning

Concrete Example: In the parity problem (Z2 group), a standard Transformer might try to count the number of 1s in a sequence to determine evenness. This shortcut fails when the sequence length exceeds the training distribution. A true FSA approach tracks the current state (Even/Odd) step-by-step, which standard Transformers fail to learn efficiently without CoT.

Key Novelty

Mechanistic Confirmation of FSA Formulation in CoT

Identifies that late-layer MLP neurons are the specific circuit responsible for representing states in the Chain-of-Thought process
Proposes two interpretability metrics, 'compression' and 'distinction', to mathematically prove the model groups diverse input histories into discrete state representations
Demonstrates that CoT allows the model to learn 'non-solvable' groups (A5) that standard Transformers and State Space Models cannot handle

Architecture

Conceptual illustration of the Z2 (parity) State Tracking problem and its corresponding Finite State Automaton (FSA).

Evaluation Highlights

Transformer+CoT achieves nearly 100% accuracy on 'compression' and 'distinction' metrics, proving it internally represents discrete FSA states
Transformer+CoT is the only architecture (compared to Mamba, S4, RNN, Standard Transformer) to successfully learn state tracking for the non-solvable group A5 on arbitrary lengths
Demonstrates robustness by maintaining state tracking capabilities even when intermediate steps are skipped or noise is introduced

Breakthrough Assessment

8/10

Strong mechanistic evidence linking CoT to FSA formation. Bridges the gap between theoretical expressiveness and practical learnability, offering a clear explanation for why CoT works on sequential reasoning.

⚙️ Technical Details

Problem Definition

Setting: State tracking as a word problem on a finite monoid (group) M, computing the product of a sequence of elements

Inputs: Sequence of group elements m1, m2, ..., mn

Outputs: Sequence of states (cumulative products) q1, q2, ..., qn

Pipeline Flow

Input Embedding
Transformer Layer (Self-Attention + MLP)
Token Unembedding (Logits)

System Modules

Input Embedding

Converts sequence of group elements into vector representations

Model or implementation: Learned Embedding

MLP (Late-Layer)

Updates the residual stream to represent the next state based on the current input and previous state

Model or implementation: Feed-forward network within Transformer block

Token Unembedding

Projects the final residual state into the vocabulary to predict the next state token

Model or implementation: Linear Projection (Logit Lens)

Novel Architectural Elements

No new architecture proposed; the novelty is the mechanistic analysis of the standard Transformer+CoT architecture

Modeling

Base Model: GPT-2 architecture (custom configuration)

Training Method: Supervised Learning on synthetic group operation sequences

Objective Functions:

Purpose: Minimize prediction error for the sequence of states.

Formally: Standard Cross-Entropy Loss on the generated state sequence.

Training Data:

1,000,000 sampled sequences per setting
Groups: Z60 (Cyclic), A4xZ5 (Solvable), A5 (Non-solvable)
Sequences of length n, successively increasing

Key Hyperparameters:

layers: 1
model_dimension: 512
max_epochs: 500
+ 1 more
early_stopping_accuracy: 99%

Compute: Not reported in the paper

Comparison to Prior Work

vs. Mamba/S4: Transformer+CoT generalizes to arbitrary lengths on A5 (non-solvable), while Mamba/S4 fail
vs. Implicit CoT (Pause): Transformer+CoT works on longer sequences where Pause fails, showing explicit tokens help retrieval of prior state
vs. RNN/LSTM: Transformer+CoT matches their expressiveness on regular languages but retains Transformer parallelization benefits
+ 1 more
vs. Standard Transformer: Standard Transformer fails on parity-style tasks (Z2, A5) out-of-distribution; CoT enables perfect generalization

Limitations

Evaluation is limited to synthetic algebraic group tasks (Z60, A5, etc.) rather than natural language reasoning
Focuses on a shallow (1-layer) Transformer setup for clearer mechanistic analysis
Does not explore how these circuits emerge in very deep Large Language Models (LLMs) with billions of parameters

Reproducibility

Code: https://github.com/IvanChangPKU/FSA

Code is publicly available at https://github.com/IvanChangPKU/FSA. Datasets are synthetically generated using the 'Abstract Algebra' Python package.

📊 Experiments & Results

Evaluation Setup

Controlled algorithmic tasks (Group Operations) to test state tracking expressiveness

Benchmarks:

Z60 Group (State Tracking (Modulo 60 Addition)) [New]
A4 x Z5 Group (State Tracking (Solvable non-abelian group)) [New]
A5 Group (State Tracking (Non-solvable alternating group)) [New]

Metrics:

Sequence Accuracy (Full sequence correctness)
Compression (Internal representation similarity for same state)
Distinction (Internal representation separation for different states)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Z2 (Parity)	Compression Metric	0.0	100.0	+100.0
Z2 (Parity)	Distinction Metric	0.0	100.0	+100.0

Experiment Figures

Performance comparison of various models (Transformer, RNN, LSTM, Mamba, S4, Transformer+CoT) across different sequence lengths and group types (Z60, A4xZ5, A5).

Main Takeaways

Transformer+CoT is the unique architecture among those tested (including Mamba, S4, and Pause) that efficiently learns state tracking for arbitrary lengths across all group types (Z60, A4xZ5, A5).
Models without CoT (Standard Transformer) or with Implicit CoT (Pause) fail to generalize to longer sequences for complex groups like A5, indicating they do not learn the recursive state tracking algorithm.
Interpretability analysis confirms the model learns a 'World Model': inputs mapping to the same state are compressed into identical representation clusters in the MLP layers, while different states are strictly distinct.
The learned algorithm is robust: it relies on the immediate previous state (embedded in the CoT scratchpad) rather than the full context history, effectively simulating an FSA.

📚 Prerequisite Knowledge

Prerequisites

Finite State Automata (FSA) theory
Transformer architecture (specifically Residual Streams and MLPs)
Group Theory (Cyclic groups, Alternating groups)
Mechanistic Interpretability (Activation Patching, Logit Lens)

Key Terms

FSA: Finite State Automata—a computational model consisting of states and transitions, used to recognize regular languages

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

State Tracking: The ability to maintain and update the status of a system (world state) as a sequence of events (inputs) occurs

MLP: Multilayer Perceptron—the feed-forward neural network block within a Transformer layer, shown here to be responsible for state updates

A5 Group: The alternating group on 5 elements; a specific mathematical structure that is 'non-solvable', making it a hard benchmark for neural networks

Solvable Group: A group that can be constructed from abelian (commutative) groups using extensions; easier for models to learn than non-solvable groups

Activation Patching: An interpretability technique where specific neuron activations are swapped between inputs to identify which components cause a model's behavior

Compression: A proposed metric measuring how similar the internal representations are for the same state reached via different input histories

Distinction: A proposed metric measuring how different the internal representations are for distinct states