PPO: Proximal Policy Optimization—a reinforcement learning algorithm that improves training stability by limiting how much the policy can change in each update using a clipped surrogate objective
GAE: Generalized Advantage Estimation—a method to estimate the advantage function (how good an action is) by exponentially averaging k-step returns to balance bias and variance
TRPO: Trust Region Policy Optimization—a precursor to PPO that strictly enforces a constraint on the policy change (KL divergence) rather than using a clipped objective
A2C: Advantage Actor-Critic—a synchronous deterministic version of the A3C algorithm that updates the policy using the advantage function
Random Reshuffling: An optimization technique where data samples are permuted (shuffled) at the start of each epoch and used exactly once per epoch, often converging faster than random sampling
Surrogate Objective: A substitute objective function used in PPO (involving probability ratios) whose gradient approximates the true policy gradient locally
Tail-mass collapse: The phenomenon identified in this paper where truncated GAE weights fail to sum to unity at the end of a trajectory, losing information
Score function: The gradient of the log-probability of the policy, ∇ log π(a|s), central to the Policy Gradient Theorem