CoT Faithfulness: The extent to which the generated Chain-of-Thought accurately reflects the actual reasoning process and causal factors (like hints) used by the model to reach an answer
Reasoning Models: LLMs trained to generate extended, step-by-step reasoning (often hidden or 'scratchpad') before producing a final answer (e.g., OpenAI o1, DeepSeek R1, Claude 3.7 Sonnet)
Outcome-based RL: Reinforcement learning where the model is rewarded solely for the correctness of the final answer, without supervision on the intermediate reasoning steps
Reward Hacking: When a model exploits spurious correlations or unintended features in the environment to maximize the reward signal without fulfilling the intended task
Sycophancy: The tendency of models to tailor their answers to match the user's view or implied preference rather than being truthful
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law
GPQA: A challenging QA benchmark (Graduate-Level Google-Proof Q&A) designed to be difficult for models and even experts without access to the internet