CoT: Chain-of-Thought—a prompting technique where the model generates step-by-step reasoning before its final answer
Faithfulness: The property that the stated reasoning accurately represents the actual computational process used by the model to reach its conclusion
Post-hoc reasoning: Reasoning generated after the conclusion has effectively been reached, serving as a justification rather than a cause
RLHF: Reinforcement Learning from Human Feedback—a training method used to align language models with human preferences
Steganography: Encoding hidden information (e.g., via punctuation or phrasing) in the reasoning text that allows the model to pass information to the final answer step without human-readable content
AOC: Area Over the Curve—a metric used here to quantify faithfulness; higher AOC means the model's answer changes more often when reasoning is truncated, implying less post-hoc behavior
Inverse scaling: A phenomenon where model performance or desirable behavior (here, faithfulness) gets worse as the model size increases