counterfactual explanation: An explanation describing the minimal change required to an input (or in this case, a policy) to produce a different specified outcome
trust region methods: Optimization techniques (like TRPO/PPO) that restrict policy updates to a specific neighborhood to ensure stability and monotonic improvement
KL-pivoting: An iterative update strategy where the reference policy (pivot) for the distance constraint is updated periodically, allowing the search to move further from the original start point
proximal operator: A mathematical tool used in optimization to solve problems by keeping the solution close to a previous point, often using a distance penalty
A2C: Advantage Actor-Critic—a synchronous, deterministic variant of the A3C reinforcement learning algorithm
PPO: Proximal Policy Optimization—a policy gradient method that uses a clipped objective function to keep updates within a trust region
TRPO: Trust Region Policy Optimization—an RL algorithm that guarantees monotonic improvement by enforcing a hard constraint on the KL divergence between policies