Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective

📝 Paper Summary

LLM Interpretability Theoretical Deep Learning Reasoning

Bounded-depth Transformers cannot theoretically solve basic math or dynamic programming tasks directly due to circuit complexity limitations, but Chain-of-Thought enables them to succeed by serving as a recursive loop that increases effective depth.

Core Problem

Autoregressive Transformers struggle to directly generate correct answers for complex tasks involving mathematics, logic, or reasoning, despite their success in other NLP areas.

Why it matters:

The underlying mechanism of why Chain-of-Thought (CoT) dramatically improves performance remains elusive and mysterious
It is unclear if LLMs have inherent limitations in solving math problems directly or if it's merely a training issue
Understanding these mechanisms is crucial for justifying the reliability of LLMs in tackling complex real-world decision-making tasks

Concrete Example: When asked to solve a system of linear equations directly, a standard Transformer might output a wrong answer immediately. However, with CoT, it generates the Gaussian elimination steps (e.g., 'Step 1: Eliminate x from eq 2...') before outputting the correct solution.

Key Novelty

Circuit Complexity Analysis of CoT

Models the Transformer as a circuit with bounded depth and analyzes the complexity class of target problems (e.g., Arithmetic is NC1-complete)
Proves that direct solution is impossible for bounded-depth Transformers (TC0) unless size is super-polynomial
Demonstrates by construction that generating CoT derivations effectively increases the circuit depth, allowing constant-size Transformers to solve these problems

Architecture

Illustration of the input/output format for Arithmetic and Equation tasks using Chain-of-Thought (CoT).

Evaluation Highlights

Proved theoretically: Bounded-depth log-precision Transformers cannot solve Arithmetic or Linear Equations directly (impossible unless size grows super-polynomially)
Proved theoretically: A Transformer with depth L=5 and constant size is sufficient to solve Arithmetic tasks using CoT
Proved theoretically: A Transformer with depth L=4 and constant size is sufficient to solve Linear Equation tasks using CoT

Breakthrough Assessment

9/10

Provides a rigorous theoretical foundation for why Chain-of-Thought is necessary, moving beyond empirical observation to mathematical proof using complexity theory.

⚙️ Technical Details

Problem Definition

Setting: Sequence generation tasks over a finite field of integers modulo p

Inputs: Sequences of tokens representing arithmetic expressions (numbers, operators) or linear equations

Outputs: The computed result (Direct) or a sequence of intermediate derivation steps followed by the result (CoT)

Pipeline Flow

Input Embedding (Token + Positional)
Transformer Layers (Attention + FFN)
Autoregressive Loop (Output fed back as input)

System Modules

Input Embedding

Encodes input tokens into vectors with positional information

Model or implementation: Standard Embedding Layer

Transformer Block

Processes representations via Self-Attention and FFN

Model or implementation: Multi-head Self-Attention + 2-layer FFN

Autoregressive Generator

Predicts next token and appends it to sequence

Model or implementation: Softmax Classifier + Loop

Novel Architectural Elements

Theoretical construction of specific 'programs' (conditional COPY, MEAN, lookup tables) within standard Transformer weights to implement math algorithms

Modeling

Base Model: Autoregressive Transformer (Theoretical Construct)

Training Method: Theoretical Construction (Existence Proof)

Trainable Parameters: Weights W_Q, W_K, W_V, W_O, W_1, W_2 constructed to perform specific logic

Key Hyperparameters:

depth_L_arithmetic: 5
depth_L_equation: 4
heads_H: 5
+ 1 more
precision: O(log n) bits

Compute: Not reported in the paper

Comparison to Prior Work

vs. MLPs: Shows Transformers have specific limitations due to parallel complexity (TC0) that MLPs (as universal approximators of sufficient depth) might not face in the same way regarding depth vs. width trade-offs in this specific context
vs. RNNs: Mentions RNNs cannot solve these math tasks using the same CoT format with constant size (Section F.2 discussion)

Limitations

Analysis is restricted to log-precision Transformers (finite precision)
Uses finite fields (integers modulo p) rather than real numbers for proofs
Theoretical impossibility results assume TC0 != NC1 (a standard but unproven complexity assumption)
Focuses on existence proofs (constructive) rather than learnability (optimization dynamics)

Reproducibility

This is primarily a theoretical paper providing existence proofs. No code or trained model weights are provided in the text snippet.

📊 Experiments & Results

Evaluation Setup

Theoretical proofs of expressivity paired with empirical validation on math/DP tasks

Benchmarks:

Arithmetic(n, p) (Evaluate arithmetic expressions in finite field) [New]
Equation(m, p) (Solve system of linear equations in finite field) [New]
Longest Increasing Subsequence (LIS) (Dynamic Programming)
Edit Distance (ED) (Dynamic Programming)

Metrics:

Expressivity (Theoretical possibility/impossibility)
Accuracy (Empirical)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Direct Generation Failure: Empirical results show that directly predicting answers for Math and DP tasks fails (accuracy mostly below 60%), aligning with the theoretical impossibility result.
CoT Success: Transformers utilizing Chain-of-Thought can consistently learn to generate correct solutions, verifying the constructive proofs that constant-size models are sufficient when depth is effectively extended via autoregression.
Generalization: Models with CoT generalize well to longer input sequences than seen during training, suggesting they learn the underlying algorithm rather than memorizing distributions.
Theoretical Hierarchy: The paper establishes that while bounded-depth Transformers (TC0) cannot solve these problems, the autoregressive CoT process allows them to tackle problems in harder complexity classes (like NC1 and P-complete DP tasks).

📚 Prerequisite Knowledge

Prerequisites

Transformer Architecture (Self-Attention, FFN)
Circuit Complexity Theory
Autoregressive Generation

Key Terms

CoT: Chain-of-Thought—a prompting strategy that induces the model to generate intermediate reasoning steps before the final answer

TC0: A complexity class of problems solvable by constant-depth circuits with unbounded fan-in and majority gates; Transformers are often mapped to this class

NC1: A complexity class of problems solvable by logarithmic-depth circuits; problems here are generally considered harder than those in TC0 and cannot be parallelized to constant depth

Log-precision Transformer: A theoretical model where internal neurons store floating-point numbers with O(log n) bit precision relative to input length n

Dynamic Programming: An algorithmic technique for solving complex problems by breaking them down into simpler subproblems in a recursive manner (e.g., Edit Distance)

Finite Field: A mathematical field containing a finite number of elements (integers modulo p), used here to simplify precision issues in proofs