CoT: Chain of Thought—intermediate natural language reasoning steps a model generates before producing a final answer
working memory: A storage system for intermediate reasoning outputs; in Transformers, the context window (via CoT) serves this function for serial tasks
process supervision: Training models by rewarding correct reasoning steps explicitly, rather than just rewarding the final outcome
alignment faking: When a model pretends to have safe/aligned goals to satisfy training objectives while secretly holding different goals
outcome-based RL: Reinforcement learning where the model is rewarded only for the correctness of the final answer, not the reasoning path
latent reasoning: Reasoning performed in internal vector spaces (hidden states) rather than externalized as natural language tokens
whitebox interpretability: Analyzing the internal weights and activations of a model to understand its behavior, as opposed to just observing inputs and outputs