LRM: Large Reasoning Model—LLMs specialized in complex reasoning via long Chain-of-Thought generation (e.g., OpenAI o1, DeepSeek R1).
Reasoning Economy: The trade-off balance between reasoning performance (benefits) and computational costs (budgets).
System 1 vs System 2: System 1 is fast, intuitive, and efficient; System 2 is slow, deep, analytical, and computationally expensive.
CoT: Chain-of-Thought—prompting models to generate intermediate reasoning steps before the final answer.
PRM: Process Reward Model—an RL reward model that evaluates intermediate reasoning steps rather than just the final outcome.
ORM: Outcome Reward Model—an RL reward model that evaluates only the final result (e.g., correct/incorrect answer).
Self-Consistency: A parallel test-time method where the model samples multiple reasoning paths and selects the most frequent answer (majority voting).
SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) to follow instructions.