LPG: Learned Policy Gradient—a meta-RL method that learns a critic producing 'bootstrap vectors' to supervise the actor
LPO: Learned Policy Optimization—a meta-RL method that learns a 'drift function' to constrain policy updates, generalizing PPO
Meta-gradients: Optimizing meta-parameters by differentiating through the inner-loop learning process (Backpropagation Through Time)
Evolution Strategies (ES): A black-box optimization method that estimates gradients by perturbing parameters and measuring fitness, avoiding the need for differentiable inner loops
Drift function: In Mirror Learning, a function measuring the divergence between new and old policies; LPO parameterizes this with a neural network
Myopic: Short-sighted; in this context, optimization that only considers immediate improvement rather than final performance at the end of training
Truncated Backpropagation: Stopping gradient calculation after a fixed number of steps to save memory, which prevents learning long-term dependencies
Antithetic sampling: A variance reduction technique in ES where perturbations are evaluated in pairs (x + noise, x - noise)
Rollback: Reversing a gradient update; observed in TA-LPO where the objective penalizes certain updates aggressively