RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences
RLAIF: Reinforcement Learning from AI Feedback—using AI models to generate preferences or rewards for training other models
DPO: Direct Preference Optimization—a stable method for aligning models to preferences without training a separate reward model or using PPO
GRPO: Group Relative Policy Optimization—an RL method used in models like DeepSeek R1 that optimizes groups of outputs relative to each other
PPO: Proximal Policy Optimization—an RL algorithm that updates policies in small, constrained steps to ensure stability
CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates low-rank matrices instead of full weights
RAG: Retrieval-Augmented Generation—enhancing model outputs by retrieving relevant external documents during inference
SCST: Self-Critical Sequence Training—an RL method where the model's own greedy decoding serves as the baseline for policy updates
Test-time Scaling: Techniques applied during inference (like increasing search depth or width) to improve performance without retraining parameters
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker