PPO: Proximal Policy Optimization—an on-policy RL algorithm that alternates between data collection and policy updates using a clipped surrogate objective
Outer Loop: The cycle in PPO where the policy collects data from environments; viewed here as a single step in a stochastic optimization process
Inner Loop: The phase in PPO where the policy is updated via multiple epochs of minibatch SGD on the collected dataset
PPO-EWMA: A PPO variant that decouples regularization from the behavior policy by regularizing towards an Exponentially Weighted Moving Average of past policies
Center of Mass (COM): A metric controlling the 'age' of the reference policy in PPO-EWMA; higher COM means regularizing towards older policies (stronger regularization)
DDR: Data to Divergence Ratio—the number of data points collected per unit of KL divergence from the behavior policy
Jax2D: A hardware-accelerated 2D physics engine used for procedural locomotion tasks
Kinetix: A complex, open-ended 2D physics-based RL environment used for large-scale evaluation
GAE: Generalized Advantage Estimation—a method to estimate the advantage function (how good an action is relative to average) with a bias-variance trade-off
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution