LRM: Large Reasoning Model—an evolution of LLMs designed specifically for complex multi-step reasoning (e.g., OpenAI o1).
PRM: Process Reward Model—a model trained to score the correctness of intermediate reasoning steps rather than just the final answer.
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.
RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences.
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without an explicit reward model loop.
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs.
In-context Learning: The ability of an LLM to adapt to a task given a few examples in the prompt without parameter updates.
Test-time scaling: Improving model performance by allocating more computation (generating more tokens/thoughts) during inference.
System 2 thinking: Deliberate, slow, and logical reasoning, as opposed to fast, intuitive System 1 thinking.