Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

📝 Paper Summary

Theoretical analysis of Transformers Chain-of-Thought (CoT) reasoning Length generalization

Theoretical analysis proving that gradient descent trains one-layer transformers to solve state-tracking tasks via Chain-of-Thought, with algebraic structure dictating whether length generalization happens automatically or requires recursive self-training.

Core Problem

It is unknown whether transformers trained via gradient descent can actually learn to solve inherently sequential reasoning problems (beyond simple TC0 tasks) and whether they can generalize to longer reasoning chains than seen during training.

Why it matters:

Current theoretical understanding is limited to expressiveness (what models *can* represent) or simple parallelizable tasks (TC0), leaving a gap in explaining how models *learn* sequential reasoning (NC1)
Length generalization is critical for LLMs to solve harder problems via longer CoT, but empirical results are mixed and mechanisms like 'context rot' are poorly understood
The distinction between problems that generalize automatically versus those needing specific curricula (like self-training) is not theoretically established

Concrete Example: Consider a 'symmetry' state-tracking task where multiple group elements map state A to state B (e.g., permutations). A model trained on short chains might learn to attend to 'distractor' clauses that happen to work for short lengths but fail for longer ones due to attention dilution. In contrast, 'cyclic' group actions have unique mappings, leading to robust attention concentration.

Key Novelty

Algebraic Structure Dictates Length Generalization

Proves that for 'simply transitive' group actions (e.g., modular addition), training on short chains automatically leads to strong attention concentration, enabling generalization to much longer sequences.
Shows that for 'symmetry' group actions (e.g., permutations), standard training fails to generalize due to attention distractors; however, a recursive self-training curriculum can bootstrap the model to solve maximal lengths.
Provides the first optimization guarantee that constant-depth transformers can learn NC1-complete problems (inherently serial tasks) via CoT, surpassing prior limits of TC0.

Evaluation Highlights

For simply transitive tasks (Cyclic C6), models trained on length L=10 achieve near 100% accuracy on lengths up to L=100.
For symmetry tasks (S5), models trained on length L=10 fail rapidly (accuracy drops to ~0) on lengths >20.
Recursive self-training on S5 enables the model to bridge this gap, extending solvable length from L=10 to L=160 with near-perfect accuracy.

Breakthrough Assessment

9/10

Significant theoretical advance: first optimization proof for learning NC1 tasks (beyond TC0) and a mechanistic explanation of length generalization linked to algebraic structure, validated by experiments.

⚙️ Technical Details

Problem Definition

Setting: Synthetic state-tracking task (LEGO) modeled as learning group actions G on state space Y.

Inputs: Sequence of predicate clauses (x_i = g_i(x_{i-1})) and initial assignment (x_0 = y_0).

Outputs: Sequence of answer clauses (x_i = y_i) generated autoregressively (Chain-of-Thought).

Pipeline Flow

Input Embedding (LEGO clauses)
One-layer Transformer (Attention + FFN)
Autoregressive Generation (Next-token prediction)

System Modules

Attention Layer

Retrieves relevant context (previous state and current action) from the sequence history.

Model or implementation: Softmax attention (1 head, no positional encoding)

Feed-Forward Network (FFN)

Implements the group operation (computes next state y_{t+1} = g(y_t)).

Model or implementation: Two-layer MLP with Smooth ReLU activation

Novel Architectural Elements

Theoretical analysis framework decoupling Attention (retrieval) and FFN (computation) learning dynamics.

Modeling

Base Model: One-layer Transformer block with NoPE (No Positional Encoding)

Training Method: Gradient Descent (GD) on Cross-Entropy Loss

Objective Functions:

Purpose: Train model to predict the next clause in the reasoning chain given history.

Formally: Loss_L(F) = Sum over steps L' of -log p_F(Z_ans,L' | Z_L,L'-1)

Adaptation: Full training from random initialization (or self-training)

Training Data:

Synthetic LEGO datasets for Cyclic groups (C6) and Symmetry groups (S5)
Sequences of variable assignments and group operations

Key Hyperparameters:

learning_rate: 0.1 or 0.5 (theoretical analysis uses polynomial scaling)
hidden_dimension_m: 1024 (experiments)
embedding_dimension_d: 64 (experiments)
+ 1 more
initialization_scale_sigma: d^-0.5

Compute: Not reported in the paper

Comparison to Prior Work

vs. Feng et al. (2023): Adds optimization guarantees (learnability) to their expressiveness results.
vs. Standard Transformer Theory: Focuses on NC1-complete tasks (non-solvable groups) rather than simple TC0 tasks like parity or linear regression.
vs. Empirical CoT papers: Provides a mechanistic proof of *why* length generalization fails (distractors) and *how* self-training fixes it.

Limitations

Analysis restricted to one-layer transformers.
Focuses on specific synthetic state-tracking tasks (LEGO), though they map to fundamental complexity classes.
Assumes large vocabulary/width asymptotic regime for theoretical proofs.
Does not analyze multi-head attention dynamics (single head effectively).

Reproducibility

Code availability is not provided. Theoretical proofs are in the appendix. Data generation process (LEGO) is fully described.

📊 Experiments & Results

Evaluation Setup

Synthetic LEGO tasks predicting final variable states after a sequence of operations.

Benchmarks:

LEGO-Cyclic (C6) (State tracking with simply transitive group action)
LEGO-Symmetry (S5) (State tracking with symmetry group action (NC1-complete))

Metrics:

Accuracy (exact match of final state)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LEGO-Cyclic (C6)	Accuracy	100.0	100.0	0.0
LEGO-Symmetry (S5)	Accuracy	100.0	0.0	-100.0
LEGO-Symmetry (S5)	Accuracy	0.0	100.0	+100.0

Experiment Figures

Comparison of length generalization between Cyclic (C6) and Symmetry (S5) tasks, and the effect of self-training.

Attention maps for Cyclic vs Symmetry tasks at convergence.

Main Takeaways

The algebraic structure of the reasoning task determines length generalization: simply transitive groups generalize naturally; symmetry groups do not.
Attention concentration is the key mechanism: sharp attention (1-hot) enables generalization, while diffuse attention (due to distractors in symmetry groups) fails at length.
Recursive self-training effectively bootstraps the model, allowing it to extend solvable length by using its own predictions as labels for longer sequences.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, FFN)
Gradient Descent (GD) dynamics
Circuit Complexity (TC0 vs NC1)
Group Theory (transitive vs symmetry actions)

Key Terms

TC0: Complexity class of problems solvable by constant-depth, polynomial-size circuits with unbounded fan-in (e.g., addition, multiplication).

NC1: Complexity class of problems solvable by logarithmic-depth circuits (e.g., non-solvable group word problems), considered inherently sequential.

CoT: Chain-of-Thought—generating intermediate reasoning steps before the final answer.

NoPE: No Positional Encoding—a transformer variant where position information is not explicitly added to embeddings.

State-tracking: Updating the status of entities step-by-step based on a sequence of actions.

LEGO: Learning Equality and Group Operations—a synthetic reasoning task involving variables, values, and operations.

Attention concentration: The mechanism where the attention head learns to focus almost exclusively on the single relevant token, ignoring distractors.

Distractors: Irrelevant tokens in the context that might incidentally point to the correct answer during training but confuse the model at test time.

Simply transitive action: A group action where exactly one group element maps any state y1 to y2 (e.g., cyclic group).

Symmetry group action: A group action where multiple group elements can map y1 to y2 (e.g., permutation group Sn), creating ambiguity/distractors.

Recursive self-training: A curriculum where the model is trained on its own generated reasoning traces from shorter lengths to solve longer lengths.