PBRFT: Preference-Based Reinforcement Fine-Tuning—optimizing LLMs on fixed static datasets to align with human preferences (e.g., RLHF, DPO)
Agentic RL: Reinforcement Learning applied to LLMs acting as autonomous agents in dynamic environments, optimizing for long-term task completion rather than just single-turn text quality
POMDP: Partially Observable Markov Decision Process—a mathematical framework where an agent makes decisions based on incomplete observations of the world state
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs to each other, eliminating the need for a separate value network (critic)
DPO: Direct Preference Optimization—a method optimizing the policy directly on preference data without an explicit reward model
PPO: Proximal Policy Optimization—an on-policy RL algorithm that constrains updates to ensure stability
SFT: Supervised Fine-Tuning—training models on labeled examples
RAG: Retrieval-Augmented Generation—enhancing LLM inputs with external data
degenerate MDP: An MDP where the time horizon T=1, effectively reducing the problem to a contextual bandit or single-step supervised learning task