I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

📝 Paper Summary

Mechanistic Interpretability Sparse Autoencoders (SAEs) LLM Reasoning

This paper identifies specific internal features in reasoning LLMs that correspond to human-like reasoning behaviors (uncertainty, exploration, reflection) using Sparse Autoencoders and a new metric called ReasonScore.

Core Problem

Reasoning LLMs like DeepSeek-R1 exhibit complex thinking processes, but their internal mechanisms remain a black box; we observe them using 'thinking' words, but do not know if specific internal components causally drive this behavior.

Why it matters:

Understanding internal reasoning mechanisms is crucial for trust and safety in advanced AI systems
Current interpretability methods often fail to isolate high-level abstract behaviors like 'reflection' or 'uncertainty' from general language modeling
Identifying these features allows for steering models to potentially improve reasoning performance or trace length

Concrete Example: When a model solves a math problem, it might output 'Wait, let me double check'. Without interpretability, we don't know if this is just surface-level text generation or if a specific 'reflection' mechanism was activated internally that caused the model to re-evaluate its previous steps.

Key Novelty

ReasonScore-guided Sparse Autoencoder Analysis

Constructs a 'Reasoning Vocabulary' by analyzing words that appear more frequently in model thinking traces than in final solutions (e.g., 'maybe', 'alternatively')
Introduces ReasonScore, a metric that identifies SAE features which activate specifically during these reasoning moments and their context windows
Validates features through 'Model Diffing', showing they emerge only after reasoning fine-tuning, and through steering experiments that enhance benchmark performance

Evaluation Highlights

+2.2 percentage points accuracy improvement on MATH-500 by steering Feature #4395 (DeepSeek-R1-Llama-8B)
+4.0 percentage points accuracy improvement on GPQA Diamond by steering Feature #16778
Increases reasoning trace length by +20.5% on MATH-500 when steering Feature #16778, confirming causal link to reasoning depth

Breakthrough Assessment

8/10

Provides compelling mechanistic evidence linking specific sparse features to high-level reasoning behaviors and demonstrates that these features can be steered to improve performance.

⚙️ Technical Details

Problem Definition

Setting: Unsupervised feature extraction from LLM activations using Sparse Autoencoders to identify directions corresponding to reasoning capabilities

Inputs: Token activations x from an intermediate layer of a reasoning LLM

Outputs: Sparse feature coefficients f(x) and identified reasoning-specific features

Pipeline Flow

Vocabulary Construction: Identify top reasoning words
SAE Training: Train SAE on model activations
Feature Scoring: Calculate ReasonScore for all features
Feature Selection: Select top features exceeding quantile threshold
Validation: Manual interpretation, Auto-interpretation, and Steering

System Modules

Vocabulary Extractor

Identify words indicating reasoning

Model or implementation: Frequency analysis on OpenThoughts-114k

Sparse Autoencoder

Decompose activations into interpretable features

Model or implementation: 2-layer MLP (Encoder/Decoder) with ReLU

ReasonScore Calculator

Rank features by relevance to reasoning

Model or implementation: Mathematical formula (Eq. 5)

Novel Architectural Elements

ReasonScore metric integration into the interpretation pipeline, specifically designing a metric that combines activation magnitude on specific tokens with an entropy penalty for specificity

Modeling

Base Model: DeepSeek-R1-Llama-8B (Layer 19 used for analysis)

Training Method: Sparse Autoencoder (SAE) training on fixed model activations

Objective Functions:

Purpose: Minimize reconstruction error of activations.

Formally: L_recon = ||x - x_hat||^2
Purpose: Enforce sparsity of features.

Formally: L_sparsity = lambda * ||f(x)||_1

Training Data:

11B tokens total
50% from LMSys-Chat-1M (Base data)
50% from OpenThoughts-114k (Reasoning data)

Key Hyperparameters:

sae_expansion_factor: 16 (m = 65,536)
learning_rate: 5e-5
batch_size: 4096
+ 2 more
sparsity_coefficient_lambda: 5
context_window: 1024 tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SAEs: Introduces ReasonScore to filter for functionally relevant features rather than just interpretable ones
vs. Probes: Unsupervised discovery of specific reasoning modes (uncertainty, reflection) rather than just binary reasoning presence
vs. Circuit Analysis [not cited in paper]: Focuses on steering global features rather than identifying subgraph connectivity

Limitations

Analysis restricted to a single layer (Layer 19) of one model family (DeepSeek-R1-Llama-8B) in main experiments
Reasoning vocabulary construction relies on frequency statistics which might miss semantic nuances
Steering strength requires careful tuning; excessive steering degrades coherence

Reproducibility

Code: https://github.com/AIRI-Institute/SAE-Reasoning

Code publicly available at https://github.com/AIRI-Institute/SAE-Reasoning. DeepSeek-R1-Llama-8B model is open weights. OpenThoughts-114k and LMSys-Chat-1M datasets are public. Hyperparameters for SAE training are detailed.

📊 Experiments & Results

Evaluation Setup

Feature steering on reasoning benchmarks

Benchmarks:

AIME 2024 (Math Olympiad problems)
MATH-500 (Math problems)
GPQA Diamond (Graduate-level science QA)

Metrics:

maj@4 Accuracy (Majority vote of 4)
Average number of tokens in reasoning trace
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Steering experiments demonstrate that amplifying specific reasoning features identified by ReasonScore improves performance on reasoning benchmarks compared to the base model.
MATH-500	maj@4 Accuracy	53.2	55.4	+2.2
GPQA Diamond	maj@4 Accuracy	33.0	37.0	+4.0
AIME 2024	maj@4 Accuracy	8.6	10.0	+1.4
Steering experiments also show that activating reasoning features causes the model to generate significantly longer reasoning traces, suggesting a causal mechanism.
MATH-500	Avg Token Count	9.9	12.0	+2.1

Experiment Figures

Bar chart showing the percentage of identified reasoning features present at different stages of model training (Base -> Base+Reasoning Data -> Reasoning Model+Data).

Main Takeaways

Identified 46 interpretable features corresponding to Uncertainty, Exploration, and Reflection.
Model diffing reveals that 60% of these reasoning features are absent in the base model and only emerge after fine-tuning on both reasoning data and the reasoning model itself.
Steering features identified by ReasonScore consistently increases the length of reasoning traces, indicating the model is engaging in more extensive 'thinking'.
Performance gains from steering are observed across multiple benchmarks (AIME, MATH, GPQA), validating the functional importance of these features.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Transformer architecture
Familiarity with Sparse Autoencoders (SAEs) for interpretability
Basic knowledge of feature steering/clamping

Key Terms

Sparse Autoencoder (SAE): A neural network trained to decompose dense model activations into a sparse set of interpretable features (directions)

ReasonScore: A metric proposed in this paper to quantify how strongly and specifically a feature activates on reasoning-related vocabulary within a context window

Feature Steering: Intervening on the model's internal state by clamping or amplifying specific feature activations during inference to modify behavior

Model Diffing: A technique to compare features across different versions of a model (e.g., base vs. fine-tuned) to see when specific capabilities emerge

Reasoning Vocabulary: A set of words (e.g., 'however', 'perhaps') identified as statistically over-represented in the model's 'thinking' process compared to final solutions

Chain-of-Thought (CoT): A prompting or generation style where the model produces intermediate reasoning steps before the final answer

L0 norm: A measure of sparsity, counting the number of non-zero elements in a vector

DeepSeek-R1: A family of reasoning-specialized LLMs trained via reinforcement learning to generate long chains of thought