RLVR: Reinforcement Learning with Verifiable Rewards—a training method that optimizes models based on the correctness of the final answer (e.g., math problems) rather than human preference labels
self-reflection: The ability of a model to revisit, evaluate, and revise its own reasoning process, often marked by tokens like 'wait'
difference-of-means: A method to find a direction in activation space by subtracting the average hidden state of one class (e.g., non-reflective) from another (e.g., reflective)
activation steering: Modifying the internal hidden states of a model during inference to influence its behavior without changing weights
residual stream: The primary vector pathway in a Transformer where information is added by attention and MLP layers
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer