← Back to Paper List

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Andrey V. Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan V. Oseledets
Artificial Intelligence Research Institute, Moscow Technical University of Communications and Informatics, Skolkovo Institute of Science and Technology, Sberbank, Higher School of Economics
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Mechanistic Interpretability Sparse Autoencoders (SAEs) LLM Reasoning
This paper identifies specific internal features in reasoning LLMs that correspond to human-like reasoning behaviors (uncertainty, exploration, reflection) using Sparse Autoencoders and a new metric called ReasonScore.
Core Problem
Reasoning LLMs like DeepSeek-R1 exhibit complex thinking processes, but their internal mechanisms remain a black box; we observe them using 'thinking' words, but do not know if specific internal components causally drive this behavior.
Why it matters:
  • Understanding internal reasoning mechanisms is crucial for trust and safety in advanced AI systems
  • Current interpretability methods often fail to isolate high-level abstract behaviors like 'reflection' or 'uncertainty' from general language modeling
  • Identifying these features allows for steering models to potentially improve reasoning performance or trace length
Concrete Example: When a model solves a math problem, it might output 'Wait, let me double check'. Without interpretability, we don't know if this is just surface-level text generation or if a specific 'reflection' mechanism was activated internally that caused the model to re-evaluate its previous steps.
Key Novelty
ReasonScore-guided Sparse Autoencoder Analysis
  • Constructs a 'Reasoning Vocabulary' by analyzing words that appear more frequently in model thinking traces than in final solutions (e.g., 'maybe', 'alternatively')
  • Introduces ReasonScore, a metric that identifies SAE features which activate specifically during these reasoning moments and their context windows
  • Validates features through 'Model Diffing', showing they emerge only after reasoning fine-tuning, and through steering experiments that enhance benchmark performance
Evaluation Highlights
  • +2.2 percentage points accuracy improvement on MATH-500 by steering Feature #4395 (DeepSeek-R1-Llama-8B)
  • +4.0 percentage points accuracy improvement on GPQA Diamond by steering Feature #16778
  • Increases reasoning trace length by +20.5% on MATH-500 when steering Feature #16778, confirming causal link to reasoning depth
Breakthrough Assessment
8/10
Provides compelling mechanistic evidence linking specific sparse features to high-level reasoning behaviors and demonstrates that these features can be steered to improve performance.
×