Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer

📝 Paper Summary

Theoretical Computability of LLMs Mechanistic Interpretability

Chain of Thought prompting enables autoregressive Transformers to approximate recurrent computation, allowing them to overcome their architectural constant-depth limitations and solve tasks requiring sequential reasoning.

Core Problem

Transformers utilize parallel attention mechanisms that eliminate recurrent connections, theoretically limiting their computational depth to O(1) and making them incapable of solving inherently sequential tasks like counting or parity.

Why it matters:

Basic reasoning tasks like multiplication and string reversal remain difficult for even advanced Large Language Models (LLMs) due to these architectural constraints.
Current understanding of Chain of Thought (CoT) is largely psychological; a mathematical framework is needed to explain why it unlocks capabilities previously thought impossible for Transformers.

Concrete Example: Calculating the n-th Fibonacci number requires the previous state (n-1). An autoregressive model seeing only the output token 'True' (partial observation) lacks the full hidden state information needed to compute the next step, whereas a recurrent model maintains the full computational state.

Key Novelty

Chain of Thought as Approximated Recurrence

Formalizes the distinction between Autoregression (generating from partial observations) and Recurrence (generating from full hidden states).
Proposes that Chain of Thought (CoT) acts as a bridge, allowing Transformers to simulate recurrence by outputting intermediate computational steps into the context window.
Introduces 'Recurrence-Completeness' to evaluate whether model architectures (like Linear Transformers or RWKV) can truly simulate recurrent state machines.

Architecture

A conceptual diagram of a Finite Automata (State Machine) processing an input string.

Breakthrough Assessment

7/10

Provides a foundational theoretical explanation for why Chain of Thought works, linking it to automata theory and computational complexity, though the provided text lacks empirical validation.

⚙️ Technical Details

Problem Definition

Setting: Computational capability analysis of Neural Networks using Automata Theory

Inputs: Input sequence x_t (e.g., a string or number sequence)

Outputs: Hidden state h_t (full computational state) and Output token o_t (partial observation)

Pipeline Flow

Input Processing
Computation (Recurrent vs Autoregressive Mapping)
Output Generation

System Modules

Recurrent Model (Theoretical) (Computation)

Maps input x_t and previous state h_{t-1} to new state h_t

Model or implementation: RNN / Finite Automata

Transformer Model (Theoretical) (Computation)

Maps inputs to outputs via parallel attention mechanisms

Model or implementation: Transformer (Standard)

Novel Architectural Elements

Theoretical framework mapping CoT tokens to recurrent state transitions in Automata

Reproducibility

Theoretical paper. No code or datasets are explicitly linked in the provided text.

📊 Experiments & Results

Evaluation Setup

Theoretical complexity analysis using Big O notation and Automata theory concepts

Benchmarks:

Parity / Counting (Sequential reasoning task)
Multiplication (Arithmetic reasoning)

Metrics:

Time Complexity
Depth Complexity
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Recurrence is formally defined as deriving current state h_t from h_{t-1}, enabling O(n) depth complexity, whereas Autoregression derives state from partial observations o_{t-1}.
Transformers have a theoretical depth complexity of O(1) (constant) regarding input length due to parallel attention, rendering them incapable of inherently solving tasks with O(n) depth requirements like parity or modular arithmetic.
Chain of Thought (CoT) allows Transformers to 'simulate' recurrence by externalizing the hidden state into the context window, effectively restoring O(n) depth capability.
Matrix multiplication with finite precision acts as a 'lookup table' (memorization) rather than true computation for complex functions; true solution requires algorithmic recurrence.

📚 Prerequisite Knowledge

Prerequisites

Automata Theory (Finite State Machines)
Computational Complexity (Big O notation)
Neural Network Architectures (RNNs vs. Transformers)

Key Terms

Recurrence: A computational process where the current state h_t is derived specifically from the previous state h_{t-1} via a function g, preserving full computational history.

Autoregression: A process where the current output is inferred from previous observed outputs (tokens) o_{t-1}, which may contain only partial information compared to the hidden state.

Chain of Thought: A prompting strategy that encourages the model to generate intermediate reasoning steps, which this paper argues simulates recurrent memory.

Depth Complexity: The number of sequential steps required to process an input; Transformers have O(1) depth due to parallelism, while RNNs have O(n) depth.

Finite Automata: A theoretical model of computation (state machine) defined by states and transitions, used here as a baseline for recurrent capability.

RWKV: Receptance Weighted Key Value—a specific RNN-like Transformer architecture mentioned as a subject of analysis.

Linear Transformer: A Transformer variant with linear attention complexity, analyzed for its recurrent capabilities.