Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

📝 Paper Summary

Interpretability of Large Reasoning Models Mechanistic Interpretability Chain-of-Thought Reasoning Dynamics

Large Reasoning Models exhibit sparse 'Mutual Information peaks' at specific 'thinking tokens' (e.g., 'Wait', 'Therefore') where internal representations become highly informative of the correct answer, enabling methods to enhance reasoning accuracy.

Core Problem

The internal reasoning mechanisms of Large Reasoning Models (LRMs) like DeepSeek-R1 remain a 'black box', making it unclear which intermediate steps critically influence the final correct answer.

Why it matters:

Understanding internal dynamics is crucial for interpreting how LRMs achieve complex problem-solving capabilities
Identifying critical steps can differentiate between meaningful reasoning and mere token generation
Current methods lack visibility into how information about the ground truth evolves during the step-by-step generation process

Concrete Example: When an LRM solves a math problem, it generates hundreds of tokens. It is unknown whether all tokens contribute equally or if specific moments, like a self-correction ('Wait, that's wrong'), carry the bulk of the information required to reach the correct solution.

Key Novelty

MI Peaks Phenomenon & Thinking Tokens

Discovers that Mutual Information (MI) between hidden states and the gold answer spikes suddenly at specific steps (MI Peaks), rather than increasing smoothly
Identifies that these peaks correspond to semantic 'thinking tokens' (e.g., 'Hmm', 'So') that trigger reflection or transition
Proposes reusing these high-information states (Representation Recycling) to improve model performance without retraining

Architecture

Illustration of the MI Peaks phenomenon in an LRM's reasoning trajectory.

Evaluation Highlights

DeepSeek-R1-Distill-Qwen-7B exhibits extremely sparse MI peaks, accounting for only 0.51% of total reasoning steps
Representation Recycling (RR) improves the accuracy of DeepSeek-R1-Distill-LLaMA-8B by 20% relatively on the AIME24 benchmark
Fully suppressing 'thinking tokens' significantly harms reasoning performance, whereas random token suppression has minimal impact

Breakthrough Assessment

8/10

Provides a novel information-theoretic explanation for LRM reasoning capabilities and successfully links theoretical MI peaks to specific semantic tokens, leading to practical inference-time improvements.

⚙️ Technical Details

Problem Definition

Setting: Analyzing the auto-regressive generation process of LRMs using Information Theory

Inputs: Input query x and the corresponding golden answer y

Outputs: Sequence of Mutual Information values I[h_t; h_y] for each generated token t

Pipeline Flow

Representation Extraction
HSIC Estimation
Peak Detection

System Modules

Representation Extractor (Analysis Pipeline)

Extracts hidden states for generated tokens and the gold answer

Model or implementation: Target LRM (e.g., DeepSeek-R1-Distill)

HSIC Estimator (Analysis Pipeline)

Estimates the Mutual Information between token representations and the gold answer

Model or implementation: Kernel-based Estimator

Peak Detector (Analysis Pipeline)

Identifies steps where MI spikes significantly above the local baseline

Model or implementation: Statistical Thresholding

Novel Architectural Elements

Inference-time intervention based on MI Peaks: Recycling the specific hidden states of 'thinking tokens' to boost reasoning accuracy (Representation Recycling)

Modeling

Base Model: DeepSeek-R1-Distill-LLaMA-8B, DeepSeek-R1-Distill-Qwen-7B/14B/32B, QwQ-32B

Training Method: Inference-time Analysis and Intervention (Training-free)

Key Hyperparameters:

tau: 1.5 (Threshold factor for defining MI peaks via IQR)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaMA-3.1-8B: Base models show lower overall MI and lack the distinct 'MI Peaks' pattern observed in LRMs
vs. Standard Chain-of-Thought: This work identifies that not all CoT steps are equal; specific 'thinking tokens' carry disproportionate information [not cited in paper]

Limitations

Analysis relies on having the ground truth answer (y) to calculate Mutual Information, limiting real-time application of the detection method
Computational cost of HSIC calculation during inference is not explicitly analyzed
The 'thinking tokens' are identified post-hoc in the analysis phase

Reproducibility

Code: https://github.com/ChnQ/MI-Peaks

Code is available at https://github.com/ChnQ/MI-Peaks. The paper uses public models (DeepSeek-R1 series, QwQ) and the MATH dataset. Specific hyperparameters for the Representation Recycling method (e.g., number of recycling iterations) are not detailed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with step-by-step solutions

Benchmarks:

MATH (Competition-level mathematics problems)
AIME24 (Advanced mathematics competition)

Metrics:

Mutual Information (estimated via HSIC)
Prediction Error Probability
Accuracy (implied for AIME24 results)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH	MI Peak Ratio (%)	100.00	0.51	-99.49

Experiment Figures

Comparison of MI trajectories between an LRM and its corresponding non-reasoning base model.

Impact of suppressing thinking tokens vs. random tokens on reasoning performance.

Main Takeaways

MI Peaks are a distinct signature of Large Reasoning Models (LRMs) like DeepSeek-R1 and are largely absent in standard base models (e.g., LLaMA-3.1), suggesting they emerge from reasoning-intensive training.
These peaks are sparse (<5% of tokens) and non-uniform, typically aligning with reflective tokens like 'Wait', 'Hmm', and 'Therefore'.
Suppressing these specific thinking tokens degrades performance significantly, confirming their causal role in reasoning, unlike random tokens.
Representation Recycling (RR) at these peak locations yields substantial performance gains (e.g., 20% relative improvement on AIME24), validating that these states are information-rich.

📚 Prerequisite Knowledge

Prerequisites

Information Theory (Mutual Information, Entropy)
Transformer Architecture (Hidden Representations)
Kernel Methods (RKHS)

Key Terms

LRM: Large Reasoning Model—LLMs trained specifically for complex reasoning (e.g., DeepSeek-R1, o1)

MI: Mutual Information—a measure of the mutual dependence between two random variables (here, the model's hidden state and the correct answer)

HSIC: Hilbert-Schmidt Independence Criterion—a statistical measure used to estimate Mutual Information by mapping distributions into a Reproducing Kernel Hilbert Space

Thinking Tokens: Specific tokens (e.g., 'Wait', 'Hmm', 'Therefore') identified by the paper as having high Mutual Information with the correct answer

MI Peaks: Sudden, significant increases in the Mutual Information trajectory during the generation process

Representation Recycling: A proposed method where the informative hidden states at MI peaks are iterated multiple times through the model to refine reasoning

RKHS: Reproducing Kernel Hilbert Space—a space of functions used in kernel methods (like HSIC) to measure dependencies between complex, high-dimensional data

TTTS: Thinking Token based Test-time Scaling—a method that forces the model to generate thinking tokens when extra compute budget is available