PPO: Proximal Policy Optimization—an RL algorithm that improves stability by clipping the probability ratio between new and old policies to prevent dangerously large updates
Trust Region: A constraint on how much a policy is allowed to change in a single update step to ensure stability
Entropy: A measure of the randomness or uncertainty in the policy's actions; high entropy indicates exploration, low entropy indicates deterministic behavior
Clipping Threshold (ε): The hyperparameter in PPO that defines the boundaries of the trust region (e.g., [1-ε, 1+ε])
GAE: Generalized Advantage Estimation—a method to estimate the 'advantage' of an action (how much better it is than average) with a trade-off between bias and variance
Ablation Study: An experiment where parts of the model are removed to test their individual contributions
Sparse Reward: Environments where the agent receives feedback (rewards) very rarely, making learning difficult