How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

📝 Paper Summary

Mechanistic Interpretability Reasoning in Large Language Models

This mechanistic analysis of Llama-27B reveals a 'functional rift' around layer 16, where the model transitions from processing ontological knowledge (pretraining prior) to generating answers via parallel pathways (in-context prior).

Core Problem

While Chain-of-Thought (CoT) prompting significantly improves LLM reasoning, the internal neural mechanisms (circuits) that implement this ability remain largely unknown.

Why it matters:

Understanding internal mechanisms is necessary to explain why reasoning is often brittle against unrelated changes
Current literature observes CoT behavior via input/output perturbation but treats the underlying neural algorithms as a black box
Verifying whether models actually rely on their generated reasoning steps (causality) vs. just hallucinating explanations requires mechanistic evidence

Concrete Example: When given a prompt like 'Numpuses are rompuses... Max is a numpus...', we do not know if the model logically deduces 'Max is a rompus' using specific attention heads or just memorizes patterns. This paper investigates if specific heads copy the 'numpus' property to 'rompus' to enable the deduction.

Key Novelty

Functional Rift and Parallel Answer Pathways

Identifies a distinct phase shift in the model's middle layers (the 'functional rift'): early layers process static relationships (ontology) while later layers handle dynamic context and answer generation
Discovers that CoT reasoning is not a single serial process but involves multiple 'parallel pathways' where different attention heads simultaneously collect information to write the answer

Evaluation Highlights

Identified a functional rift at the 16th decoder block where token representations shift from pretraining priors to in-context priors
Localized attention heads responsible for ontological information transfer (moving properties between entities) primarily to the first 16 layers
Demonstrated that answer writing heads appear almost exclusively at or after the 16th decoder block, indicating a structured depth-wise specialization

Breakthrough Assessment

8/10

Provides a significant step forward in opening the 'black box' of CoT reasoning by mapping abstract reasoning steps to specific model layers and attention heads, moving beyond behavioral analysis.

⚙️ Technical Details

Problem Definition

Setting: Mechanistic interpretation of autoregressive Transformer language models performing multi-step reasoning

Inputs: Prompt containing fictional ontology rules (PrOntoQA), a question, and a 'Let's think step by step' trigger

Outputs: Internal circuit activations, attention patterns, and the generated Chain-of-Thought reasoning sequence

Pipeline Flow

Input Embedding (Project tokens to residual stream)
Early Layers 1-16 (Ontology Processing & Copying)
Functional Rift (Phase Shift)
Late Layers 16-32 (Answer Collection & Writing)
Unembedding (Logit Projection)

System Modules

Early Attention Blocks

Transfer information between ontologically related entities (e.g., copying properties from 'numpus' to 'rompus')

Model or implementation: Llama-27B (decoder blocks 1-16)

Late Attention Blocks

Collect information from the context and CoT to write the final answer token

Model or implementation: Llama-27B (decoder blocks 16+)

Novel Architectural Elements

The identification of a 'Functional Rift' at the 16th decoder block is a conceptual architectural finding regarding how the pre-trained model organizes its computation depth-wise

Modeling

Base Model: Llama-27B

Training Method: Inference-only analysis (Mechanistic Interpretation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Toy Models: This work analyzes a production-scale LLM (Llama-27B) rather than 1-2 layer toy models
vs. IOI/Causal Abstraction: Focuses specifically on the mechanism of Chain-of-Thought (CoT) reasoning rather than simple factual or grammatical tasks
vs. Standard CoT papers (Wei et al.): Focuses on internal neural mechanisms (heads/layers) rather than just input-output prompting performance

Limitations

Analysis relies on fictional ontologies (PrOntoQA) to minimize MLP fact retrieval, which may differ from real-world reasoning that heavily uses MLPs
Reverse-engineering is complicated by the 'hydra effect' where backup circuits compensate for ablated heads
The specific 'Llama-27B' model name in the text is likely a typo for Llama-2-7B or similar, requiring verification of exact model size used

Reproducibility

Code: https://github.com/joykirat18/How-To-Think-Step-by-Step

Code and data are publicly available at https://github.com/joykirat18/How-To-Think-Step-by-Step. The paper analyzes Llama-27B (likely a typo for Llama-2-7B or similar, based on citation Touvron et al. 2023) using the PrOntoQA dataset.

📊 Experiments & Results

Evaluation Setup

Mechanistic analysis of the model's internal states during inference on fictional reasoning tasks

Benchmarks:

PrOntoQA (Multi-step reasoning over fictional ontologies)

Metrics:

Activation Patching Score
Logit Lens Probability
Statistical methodology: Mean-ablation used for knockout experiments to establish causal roles of attention heads

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The following results localize specific reasoning functions to specific depths in the model, establishing the 'Functional Rift'.
PrOntoQA	Ontology Mapping Layer Range	Not applicable	Layers 1-16	Not applicable
PrOntoQA	Answer Writing Layer Start	Not applicable	Layer 16	Not applicable

Main Takeaways

LLMs utilize parallel pathways for step-by-step reasoning; multiple attention heads contribute to the answer simultaneously rather than a single serial bottleneck.
A 'functional rift' exists in the middle of the model (around block 16 for Llama-27B): the first half processes ontology and pretraining priors, while the second half handles in-context priors and answer generation.
The model employs mechanisms similar to 'induction heads' to copy information between fictional entities defined in the context (e.g., applying rules from 'numpus' to 'rompus').
Token representations in the residual stream gradually shift from representing pretraining associations to representing the specific in-context reasoning problem as they pass through the layers.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, MLP, Residual Stream)
Mechanistic Interpretability techniques (Activation Patching, Knockout)
In-context learning vs. Pretraining priors

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Mechanistic Interpretability: A research field aiming to reverse-engineer neural networks to understand the specific algorithms and circuits they implement

Activation Patching: A technique to localize model function by swapping internal activations between a clean run and a corrupted run to see if the output is restored

Functional Rift: The paper's term for the observation that early model layers and late model layers perform distinct, almost disjoint types of processing (ontology mapping vs. answer generation)

Induction Heads: Attention heads that copy patterns from the context (e.g., 'A followed B before, so predict B after A'), crucial for in-context learning

Logit Lens: A method to interpret intermediate layer activations by projecting them into the vocabulary space to see what token they would predict if the model stopped there

Hydra Effect: The phenomenon where removing one component of a model causes other components to compensate, making it difficult to isolate specific functions

PrOntoQA: A dataset of synthetic reasoning problems based on fictional ontologies (e.g., 'Numpuses are rompuses') used to test logical capacity without interference from real-world facts