Schoenfeld's Episode Theory: A cognitive science framework that segments problem-solving into functional episodes like Read, Analyze, Plan, Implement, Explore, Verify, and Monitor
ThinkARM: The proposed framework (Anatomy of Reasoning in Models) for automatically annotating LLM reasoning traces with episode labels
CoT: Chain-of-Thoughtβa prompting technique where models generate intermediate reasoning steps before the final answer
Reasoning Models: Models specifically trained (often via RL) to generate long, detailed reasoning traces (e.g., OpenAI o1, DeepSeek-R1)
Non-Reasoning Models: Standard instruction-following models that typically jump to execution without extended exploration or verification loops
Episode N-grams: Sequences of $N$ consecutive episode labels used to analyze transition patterns (e.g., Explore β Monitor)
Cognitive Heartbeat: The characteristic temporal distribution of episodes over a reasoning trace (e.g., Analyze decaying early, Implement peaking in the middle, Verify rising at the end)
Omni-MATH: A challenging mathematical reasoning benchmark used as the source for problem statements in this study
Lasso-regularized logistic regression: A regression analysis method that uses L1 regularization to select a sparse set of most important features for prediction
Kappa score: A statistical measure of inter-rater reliability (agreement) between annotators, correcting for chance agreement