PPO: Proximal Policy Optimization—an RL algorithm that updates the policy while limiting how much it changes at each step to ensure stability
Credit Assignment: The problem of determining which past actions are responsible for a final outcome (reward)
Value Network: A neural network (Critic) trained to predict the expected future reward from a given state
Monte Carlo (MC) Rollout: Simulating a trajectory from a specific state to the end of the episode to observe the actual reward
Vine: A method from TRPO where the environment is reset to a specific state to perform multiple rollouts, creating a 'vine' of trajectories for variance reduction
GRPO: Group Relative Policy Optimization—an RL method that normalizes rewards within a group of samples to reduce variance without a value network
DPO: Direct Preference Optimization—a method optimizing the policy directly from preference data without explicit reward modeling
RLOO: Reinforce Leave-One-Out—a policy gradient method using the average of other samples as a baseline
SFT: Supervised Fine-Tuning—initial training of the model on labeled data
Chain-of-Thought: A reasoning strategy where the model generates intermediate steps before the final answer