Sparse Autoencoder (SAE): A neural network trained to decompose dense model activations into a sparse set of interpretable features (directions)
ReasonScore: A metric proposed in this paper to quantify how strongly and specifically a feature activates on reasoning-related vocabulary within a context window
Feature Steering: Intervening on the model's internal state by clamping or amplifying specific feature activations during inference to modify behavior
Model Diffing: A technique to compare features across different versions of a model (e.g., base vs. fine-tuned) to see when specific capabilities emerge
Reasoning Vocabulary: A set of words (e.g., 'however', 'perhaps') identified as statistically over-represented in the model's 'thinking' process compared to final solutions
Chain-of-Thought (CoT): A prompting or generation style where the model produces intermediate reasoning steps before the final answer
L0 norm: A measure of sparsity, counting the number of non-zero elements in a vector
DeepSeek-R1: A family of reasoning-specialized LLMs trained via reinforcement learning to generate long chains of thought