CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps to solve complex problems
LRMs: Large Reasoning Models—LLMs specifically optimized for reasoning tasks, often via RL (e.g., OpenAI o1, DeepSeek-R1)
Overthinking Phenomenon: The tendency of reasoning models to generate excessively detailed or redundant steps for simple problems, wasting compute
System-1 vs System-2: System-1 refers to fast, intuitive thinking; System-2 refers to slow, deliberate, step-by-step reasoning
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used to train models
SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs)
Process Reward Model (PRM): A reward model that evaluates the correctness of intermediate reasoning steps, not just the final answer
Monte Carlo Tree Search (MCTS): A search algorithm used to explore reasoning paths by simulating future outcomes, speculated to be used in models like OpenAI o1
KV Cache: Key-Value Cache—memory used during LLM inference to store past token representations; compressing it speeds up generation
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights