Inference Scaling: Allocating more computational resources at test-time (e.g., generating many candidates, tree search) to improve performance without retraining the model
Learning-to-Reason: Enhancing reasoning capabilities through dedicated training (SFT or RL) so the model internalizes the thinking process, reducing reliance on costly inference compute
Agentic Systems: AI systems that exhibit interactivity and autonomy, using tools (environment) or communicating with other agents to refine reasoning
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
PPO: Proximal Policy Optimization—a reinforcement learning algorithm commonly used to align LLMs with human preferences or reasoning objectives
GRPO: Group Relative Policy Optimization—a recent RL algorithm (used in DeepSeek-R1) that optimizes reasoning without a separate critic model by comparing a group of outputs
ORM: Outcome Reward Model—a verifier that evaluates the final answer of a reasoning chain
PRM: Process Reward Model—a verifier that evaluates the correctness of each intermediate step in a reasoning chain
SFT: Supervised Fine-Tuning—training a model on labeled examples (e.g., question + correct reasoning trace)
AGI: Artificial General Intelligence—AI systems with broad, human-like cognitive abilities
DPO: Direct Preference Optimization—a method to align models to preferences without an explicit reward model