Guhao Feng, Yuntian Gu, Bohang Zhang, Haotian Ye, Di He, Liwei Wang
School of EECS, Peking University
Neural Information Processing Systems
(2023)
ReasoningBenchmark
📝 Paper Summary
LLM InterpretabilityTheoretical Deep LearningReasoning
Bounded-depth Transformers cannot theoretically solve basic math or dynamic programming tasks directly due to circuit complexity limitations, but Chain-of-Thought enables them to succeed by serving as a recursive loop that increases effective depth.
Core Problem
Autoregressive Transformers struggle to directly generate correct answers for complex tasks involving mathematics, logic, or reasoning, despite their success in other NLP areas.
Why it matters:
The underlying mechanism of why Chain-of-Thought (CoT) dramatically improves performance remains elusive and mysterious
It is unclear if LLMs have inherent limitations in solving math problems directly or if it's merely a training issue
Understanding these mechanisms is crucial for justifying the reliability of LLMs in tackling complex real-world decision-making tasks
Concrete Example:When asked to solve a system of linear equations directly, a standard Transformer might output a wrong answer immediately. However, with CoT, it generates the Gaussian elimination steps (e.g., 'Step 1: Eliminate x from eq 2...') before outputting the correct solution.
Key Novelty
Circuit Complexity Analysis of CoT
Models the Transformer as a circuit with bounded depth and analyzes the complexity class of target problems (e.g., Arithmetic is NC1-complete)
Proves that direct solution is impossible for bounded-depth Transformers (TC0) unless size is super-polynomial
Demonstrates by construction that generating CoT derivations effectively increases the circuit depth, allowing constant-size Transformers to solve these problems
Architecture
Illustration of the input/output format for Arithmetic and Equation tasks using Chain-of-Thought (CoT).
Evaluation Highlights
Proved theoretically: Bounded-depth log-precision Transformers cannot solve Arithmetic or Linear Equations directly (impossible unless size grows super-polynomially)
Proved theoretically: A Transformer with depth L=5 and constant size is sufficient to solve Arithmetic tasks using CoT
Proved theoretically: A Transformer with depth L=4 and constant size is sufficient to solve Linear Equation tasks using CoT
Breakthrough Assessment
9/10
Provides a rigorous theoretical foundation for why Chain-of-Thought is necessary, moving beyond empirical observation to mathematical proof using complexity theory.
⚙️ Technical Details
Problem Definition
Setting: Sequence generation tasks over a finite field of integers modulo p
Inputs: Sequences of tokens representing arithmetic expressions (numbers, operators) or linear equations
Outputs: The computed result (Direct) or a sequence of intermediate derivation steps followed by the result (CoT)
Pipeline Flow
Input Embedding (Token + Positional)
Transformer Layers (Attention + FFN)
Autoregressive Loop (Output fed back as input)
System Modules
Input Embedding
Encodes input tokens into vectors with positional information
Model or implementation: Standard Embedding Layer
Transformer Block
Processes representations via Self-Attention and FFN
Model or implementation: Multi-head Self-Attention + 2-layer FFN
Autoregressive Generator
Predicts next token and appends it to sequence
Model or implementation: Softmax Classifier + Loop
Novel Architectural Elements
Theoretical construction of specific 'programs' (conditional COPY, MEAN, lookup tables) within standard Transformer weights to implement math algorithms
Modeling
Base Model: Autoregressive Transformer (Theoretical Construct)
Training Method: Theoretical Construction (Existence Proof)
Trainable Parameters: Weights W_Q, W_K, W_V, W_O, W_1, W_2 constructed to perform specific logic
vs. MLPs: Shows Transformers have specific limitations due to parallel complexity (TC0) that MLPs (as universal approximators of sufficient depth) might not face in the same way regarding depth vs. width trade-offs in this specific context
vs. RNNs: Mentions RNNs cannot solve these math tasks using the same CoT format with constant size (Section F.2 discussion)
Limitations
Analysis is restricted to log-precision Transformers (finite precision)
Uses finite fields (integers modulo p) rather than real numbers for proofs
Theoretical impossibility results assume TC0 != NC1 (a standard but unproven complexity assumption)
Focuses on existence proofs (constructive) rather than learnability (optimization dynamics)
Reproducibility
This is primarily a theoretical paper providing existence proofs. No code or trained model weights are provided in the text snippet.
📊 Experiments & Results
Evaluation Setup
Theoretical proofs of expressivity paired with empirical validation on math/DP tasks
Benchmarks:
Arithmetic(n, p) (Evaluate arithmetic expressions in finite field) [New]
Equation(m, p) (Solve system of linear equations in finite field) [New]
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
Direct Generation Failure: Empirical results show that directly predicting answers for Math and DP tasks fails (accuracy mostly below 60%), aligning with the theoretical impossibility result.
CoT Success: Transformers utilizing Chain-of-Thought can consistently learn to generate correct solutions, verifying the constructive proofs that constant-size models are sufficient when depth is effectively extended via autoregression.
Generalization: Models with CoT generalize well to longer input sequences than seen during training, suggesting they learn the underlying algorithm rather than memorizing distributions.
Theoretical Hierarchy: The paper establishes that while bounded-depth Transformers (TC0) cannot solve these problems, the autoregressive CoT process allows them to tackle problems in harder complexity classes (like NC1 and P-complete DP tasks).
📚 Prerequisite Knowledge
Prerequisites
Transformer Architecture (Self-Attention, FFN)
Circuit Complexity Theory
Autoregressive Generation
Key Terms
CoT: Chain-of-Thought—a prompting strategy that induces the model to generate intermediate reasoning steps before the final answer
TC0: A complexity class of problems solvable by constant-depth circuits with unbounded fan-in and majority gates; Transformers are often mapped to this class
NC1: A complexity class of problems solvable by logarithmic-depth circuits; problems here are generally considered harder than those in TC0 and cannot be parallelized to constant depth
Log-precision Transformer: A theoretical model where internal neurons store floating-point numbers with O(log n) bit precision relative to input length n
Dynamic Programming: An algorithmic technique for solving complex problems by breaking them down into simpler subproblems in a recursive manner (e.g., Edit Distance)
Finite Field: A mathematical field containing a finite number of elements (integers modulo p), used here to simplify precision issues in proofs