PPO: Proximal Policy Optimization—an on-policy reinforcement learning algorithm that restricts policy updates to a small trust region to ensure stability.
DDQN: Double Deep Q-Network—an off-policy RL algorithm that uses two networks to reduce overestimation bias in action-value estimation.
Importance Sampling: A statistical technique used to estimate properties of a target distribution while sampling from a different proposal distribution, often used in RL to reuse old data.
Replay Buffer: A memory structure that stores past agent experiences (state, action, reward) to be reused for training, typically used in off-policy methods.
Sparse Rewards: Environments where the agent receives non-zero feedback very infrequently (e.g., only upon solving a maze), making learning difficult.
Distribution Shift: The phenomenon where the data distribution in the replay buffer differs from the data distribution generated by the current policy.