Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

📝 Paper Summary

Theoretical expressiveness of Transformers Chain of Thought (CoT) reasoning

Theoretically proves that while standard transformers are limited to parallelizable problems, adding Chain of Thought steps empowers them to solve inherently serial problems by acting as a general-purpose sequential computer.

Core Problem

Standard decoder-only transformers are theoretically limited to solving problems that can be computed in parallel (low circuit depth), making them incapable of solving inherently serial tasks regardless of size.

Why it matters:

Explains why LLMs struggle with math and logic puzzles without step-by-step prompting despite massive scale
Current theoretical bounds for transformers are loose, often assuming unrealistic precision or failing to explain the specific utility of intermediate steps
Understanding the mechanism of CoT is crucial for designing better architectures for complex reasoning

Concrete Example: A standard transformer cannot solve the 'permutation composition' problem (calculating the result of applying multiple permutations in sequence, e.g., f(g(h(x)))) because it requires tracking the state sequentially, which a parallel circuit cannot do efficiently. With CoT, it can output the result of h(x), then g(h(x)), and finally f(g(h(x))).

Key Novelty

CoT as a Serial Computing Mechanism

Establishes a tighter upper bound for standard constant-precision transformers, proving they are limited to the complexity class AC0 (simpler than previously thought TC0)
Proves that with T steps of CoT, these transformers can simulate any Boolean circuit of size T, effectively enabling general serial computation
Demonstrates that polynomial steps of CoT allow transformers to solve any problem in P/poly (polynomial size circuits), a massive jump in expressiveness

Architecture

A conceptual diagram contrasting Standard Transformers vs. CoT Transformers mapped to complexity classes

Evaluation Highlights

Transformers with CoT achieve >90% accuracy on 5-element permutation composition tasks where standard transformers fail completely (<10% accuracy) even with depth 16
On modular addition (a parallelizable task), standard transformers solve it easily (100% accuracy) with depth 1, validating the theory that CoT is only needed for serial tasks
CoT enables solving the Circuit Value Problem (tracking logic gate values) with near 100% accuracy, whereas standard models remain at random guessing (~50%) regardless of depth

Breakthrough Assessment

9/10

Provides a fundamental theoretical justification for why CoT works. It bridges the gap between empirical success and circuit complexity theory, offering rigorous proofs for the serial vs. parallel nature of Transformers.

⚙️ Technical Details

Problem Definition

Setting: Binary classification or token prediction tasks modeled as Boolean functions or discrete mappings

Inputs: Sequence of tokens representing a problem instance (e.g., a math problem or logic circuit)

Outputs: Final answer token, optionally preceded by T intermediate 'thought' tokens

Pipeline Flow

Input Sequence
Transformer (Autoregressive Generation)
Output Sequence (CoT tokens + Answer)

System Modules

Transformer Decoder

Next-token prediction engine acting as a logic gate evaluator at each step

Model or implementation: Decoder-only Transformer (GPT-style)

Novel Architectural Elements

Theoretical framework treating the autoregressive generation process as a sequential circuit evaluation, where each generated token represents the output of a specific gate in a simulated circuit

Modeling

Base Model: GPT-style Decoder-only Transformer

Training Method: Supervised learning from scratch on synthetic datasets

Objective Functions:

Purpose: Minimize prediction error for next token.

Formally: Cross-entropy loss over the target sequence

Adaptation: Full training

Training Data:

Synthetic datasets generated for 4 tasks: Modular Addition, Permutation Composition, Iterated Squaring, Circuit Value Problem
Data includes input-only examples (Standard) and input+intermediate-steps examples (CoT)

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 64
optimizer: AdamW
+ 3 more
embedding_dimension: 64 to 256 (varied by task)
layers: 1 to 16 (varied to test depth dependency)
heads: 4

Compute: Trained on synthetic data; specific GPU resources not reported in the paper

Comparison to Prior Work

vs. Merrill & Sabharwal (2023b): Provides a tighter bound (AC0) for constant-precision transformers and explicitly models the CoT process as circuit simulation
vs. Liu et al. (2022b): Extends analysis from parallel expressiveness to serial expressiveness via CoT steps
vs. Feng et al. (2024) [not cited in paper]: Feng et al. explore CoT in RNNs; this paper focuses strictly on the Transformer architecture's specific parallel limitations

Limitations

Theoretical bounds assume constant precision, which may not perfectly model floating-point implementations
Results rely on complexity theoretic conjectures (e.g., NC1 is not in TC0) which are widely believed but unproven
Experiments are on synthetic algorithmic tasks, not natural language reasoning benchmarks
Does not analyze the difficulty of *learning* the CoT strategy, only the *expressiveness* (capability) to represent it

Reproducibility

Theoretical proofs are fully contained in the paper. Synthetic data generation logic is described for the 4 tasks (Modular Addition, Permutation Composition, Iterated Squaring, CVP). Code availability is not explicitly mentioned.

📊 Experiments & Results

Evaluation Setup

Supervised training on synthetic algorithmic tasks requiring varying degrees of serial computation

Benchmarks:

Modular Addition (Parallelizable arithmetic (TC0)) [New]
Permutation Composition (Serial composition (NC1-complete for S5)) [New]
Iterated Squaring (Serial arithmetic) [New]
Circuit Value Problem (CVP) (P-complete serial logic simulation) [New]

Metrics:

Accuracy (Exact Match of final answer)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CoT enables transformers to solve inherently serial problems that standard transformers cannot solve regardless of depth.
Permutation Composition (S_5)	Accuracy	0.10	0.95	+0.85
Circuit Value Problem (Width 16)	Accuracy	0.50	1.00	+0.50
Iterated Squaring	Accuracy	0.10	0.95	+0.85
Parallelizable problems do not require CoT, validating the theoretical distinction.
Modular Addition	Accuracy	1.00	1.00	0.00

Experiment Figures

Accuracy plots vs. Transformer Depth for the 4 tasks (Modular Addition, Permutation, Squaring, CVP) comparing Standard vs. CoT

Main Takeaways

Standard constant-depth transformers are strictly limited to AC0 problems (unable to solve serial tasks or even parity without sufficient depth/precision)
CoT transforms the transformer into a serial computer, allowing it to solve any problem solvable by polynomial-size circuits (P/poly) given polynomial steps
The 'form' of CoT is crucial: simply outputting intermediate tokens enables the serial computation mechanism, regardless of whether those tokens are 'natural language' explanations
Depth 1 is sufficient for CoT to solve complex serial tasks, whereas depth 16 without CoT fails, proving that autoregressive steps are a more powerful scaling dimension than layer depth for these problems

📚 Prerequisite Knowledge

Prerequisites

Circuit complexity classes (AC0, TC0, NC1, P/poly)
Transformer architecture (Attention, MLP)
Boolean circuits and logic gates

Key Terms

CoT: Chain of Thought—prompting the model to generate intermediate reasoning steps before the final answer

AC0: A complexity class of problems solvable by circuits with constant depth and polynomial size, using AND, OR, NOT gates with unbounded fan-in (cannot solve parity)

TC0: A complexity class like AC0 but extended with MAJORITY gates (can solve parity and simple arithmetic)

NC1: A complexity class of problems solvable by circuits with logarithmic depth and bounded fan-in (captures simple serial computations)

P/poly: The class of problems solvable by polynomial-size Boolean circuits; contains all efficient deterministic algorithms (P)

Permutation Composition: The task of computing the product of a sequence of permutations; inherently serial because the state depends on the entire history

Circuit Value Problem (CVP): Given a Boolean circuit and inputs, compute the output; a canonical P-complete problem that is inherently serial

Iterated Squaring: Computing x^(2^n) mod m; requires sequential multiplications and is hard to parallelize

Modular Addition: Computing the sum of numbers modulo m; theoretically parallelizable (in TC0)

Fan-in: The number of input wires feeding into a logic gate