Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

📝 Paper Summary

Interpretability of Chain-of-Thought Mathematical Reasoning Analysis Cognitive Architectures for LLMs

ThinkARM applies Schoenfeld's Episode Theory to automatically segment LLM reasoning traces into functional stages like Explore and Verify, revealing that reasoning models uniquely exhibit iterative exploration-verification loops compared to standard models.

Core Problem

Current evaluations of reasoning models rely on outcome-oriented metrics (accuracy, length) or token-level statistics, which fail to capture the functional structure and intermediate dynamics of how models actually think.

Why it matters:

Longer reasoning chains ('overthinking') do not always equal better correctness, yet we lack tools to diagnose why specific long traces fail
Functional behaviors like 'exploration' vs 'execution' are discussed intuitively but lack rigorous quantification, making it hard to compare reasoning styles across models
Understanding the internal structure of reasoning is crucial for diagnosing failures and distinguishing genuine reasoning from mere pattern matching or memorization

Concrete Example: When solving a math problem, a standard model might jump straight to 'Implement' (calculating values). A reasoning model might first 'Analyze' the problem structure, then 'Explore' a hypothesis, then 'Verify' it. Current metrics only see that the second model generated more tokens, missing the structural difference in cognitive control.

Key Novelty

ThinkARM (Anatomy of Reasoning in Models)

Adapts Schoenfeld's cognitive science framework (originally for human problem solving) to abstract LLM token sequences into eight functional episodes (e.g., Read, Analyze, Explore, Implement, Verify)
Uses a strong LLM as an automated annotator to label sentences in reasoning traces at scale, enabling quantitative analysis of 'cognitive heartbeats' across different model families

Architecture

Overview of the ThinkARM framework, illustrating the mapping from raw reasoning tokens to functional episode labels

Evaluation Highlights

Reasoning models (e.g., DeepSeek-R1) allocate significantly more budget to 'Analyze' and 'Explore' episodes compared to non-reasoning models which are dominated by 'Implement'
Correct solutions are strongly associated with 'Explore → Monitor' and 'Explore → Analyze' transitions, while incorrect solutions often show 'Explore' leading directly to 'Verify' or premature termination
Distilled reasoning models (e.g., R1-Distill-Qwen-1.5B) preserve the episode allocation structure of their teacher models despite being much smaller

Breakthrough Assessment

8/10

Provides a novel, theory-grounded lens for analyzing CoT that moves beyond token length. The findings on 'cognitive heartbeats' and the distinct structural signatures of reasoning models are empirically strong and insightful.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving with explicit reasoning traces

Inputs: Math problem statement $P$

Outputs: Reasoning trace consisting of a sequence of sentences $S = (s_1, ..., s_T)$ and a final answer

Pipeline Flow

Problem Sampling (from Omni-MATH)
Trace Generation (15 diverse LLMs)
Sentence Segmentation
Episode Annotation (Automated via GPT-5)
Pattern Analysis (Aggregation & Transitions)

System Modules

Annotator

Classify each sentence in a reasoning trace into one of 8 episode categories

Model or implementation: GPT-5 (selected via gold-set evaluation)

Novel Architectural Elements

Adoption of Schoenfeld's 7-episode taxonomy + 'Answer' episode as a discrete state space for LLM reasoning analysis
Sentence-level automated annotation pipeline replacing manual or hierarchical annotation schemes

Modeling

Base Model: GPT-5 (as the annotation engine)

Training Method: In-context learning / Prompting (System does not train the models being analyzed)

Adaptation: None (Prompt-based)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. Li et al. (2025): Scales to 15 models and 410k sentences; uses strictly sentence-level annotation for uniform comparison; introduces 'Answer' episode
vs. Outcome metrics: Explains *how* models reason (process) rather than just *how well* they answer (outcome)
vs. Surface statistics: Abstracts tokens into functional cognitive steps, enabling analysis of strategy (e.g., 'Explore -> Monitor' loops)

Limitations

Relies on GPT-5 for annotation, which may have its own biases or errors despite high agreement with humans
Analysis is restricted to mathematical reasoning; generalization to coding or commonsense reasoning is not tested
The 'Answer' episode is an artificial addition to Schoenfeld's theory to accommodate LLM output formats
Correlational analysis does not imply causation (e.g., forcing a model to 'Monitor' more might not improve correctness)

Reproducibility

Code: https://github.com/MingLiiii/ThinkARM

publicly available (https://github.com/MingLiiii/ThinkARM). The repository contains the code and data. The 7,067 human-annotated gold standard sentences are available. The prompt for automated annotation is in Appendix F.

📊 Experiments & Results

Evaluation Setup

Large-scale analysis of reasoning traces from 15 LLMs on 100 Omni-MATH problems

Benchmarks:

Omni-MATH (subset) (Mathematical reasoning)

Metrics:

Episode distribution (token allocation %)
Transition probability (N-gram Mutual Information)
Annotation Accuracy / Kappa score (for validator selection)
Coefficient magnitude in correctness prediction (Logistic Regression)
Statistical methodology: Mutual Information (MI) for transition patterns; Lasso-regularized logistic regression for feature importance

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Annotation model selection shows GPT-5 aligns best with human judgment.
Gold Set (Reasoning Models)	Accuracy	59.3	73.2	+13.9
Gold Set (Reasoning Models)	Kappa	47.2	64.5	+17.3
Allocations reveal distinct profiles: Reasoning models invest in 'Analyze' and 'Explore', while baselines dominate in 'Implement'.
Reasoning Traces	Analyze Token %	2.8	11.1	+8.3
Reasoning Traces	Explore Token %	0.4	9.3	+8.9
Reasoning Traces	Implement Token %	93.4	49.6	-43.8

Experiment Figures

Word clouds of the most frequent tokens associated with each episode

Temporal evolution of episode frequency over normalized progress (0-100%)

Main Takeaways

A 'Cognitive Heartbeat' exists across models: Analysis/Planning decay early, Implementation peaks in the middle, and Verification/Monitoring rise towards the end
Reasoning models are distinguished by iterative loops (Explore-Monitor, Verify-Explore) rather than linear feed-forward execution
Correctness is positively predicted by transitions from exploration to monitoring (Explore -> Monitor), suggesting successful error detection or strategy adjustment
Efficiency-oriented (distilled) models selectively suppress evaluative steps (Verify/Monitor) rather than uniformly shortening the trace

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Basic understanding of LLM generation (tokens, sampling)
Familiarity with reasoning heuristics (exploration, verification)

Key Terms

Schoenfeld's Episode Theory: A cognitive science framework that segments problem-solving into functional episodes like Read, Analyze, Plan, Implement, Explore, Verify, and Monitor

ThinkARM: The proposed framework (Anatomy of Reasoning in Models) for automatically annotating LLM reasoning traces with episode labels

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

Reasoning Models: Models specifically trained (often via RL) to generate long, detailed reasoning traces (e.g., OpenAI o1, DeepSeek-R1)

Non-Reasoning Models: Standard instruction-following models that typically jump to execution without extended exploration or verification loops

Episode N-grams: Sequences of $N$ consecutive episode labels used to analyze transition patterns (e.g., Explore → Monitor)

Cognitive Heartbeat: The characteristic temporal distribution of episodes over a reasoning trace (e.g., Analyze decaying early, Implement peaking in the middle, Verify rising at the end)

Omni-MATH: A challenging mathematical reasoning benchmark used as the source for problem statements in this study

Lasso-regularized logistic regression: A regression analysis method that uses L1 regularization to select a sparse set of most important features for prediction

Kappa score: A statistical measure of inter-rater reliability (agreement) between annotators, correcting for chance agreement