Counterfactual Explanation Policies in RL

📝 Paper Summary

Explainable Reinforcement Learning (XRL) Counterfactual Explanations

COUNTERPOL generates explanatory policies by finding the minimal change to an existing RL policy required to achieve a specific target return, revealing how behavior must shift to improve or deteriorate.

Core Problem

Existing RL explainability methods identify important input features or trajectories but fail to explain what minimal changes to the policy itself would lead to a desired improvement or deterioration in performance.

Why it matters:

Trust in autonomous agents requires understanding not just what decision was made, but how the policy could be modified to achieve better results
Current methods attribute actions to observations but cannot systematically answer 'what if' questions about the policy's overall performance target
Unlearning specific skills or debugging agent failure modes requires knowing exactly which policy behaviors lead to lower returns

Concrete Example: In a BipedalWalker environment, a standard policy might have a peculiar walk. Current tools highlight state features, but cannot explain that to achieve a higher return (e.g., 150), the policy specifically needs to adopt a more upright posture, whereas to degrade to a return of 50, it must intensify kneeling.

Key Novelty

Counterfactual Explanation Policy (COUNTERPOL)

Formulates explanations as an optimization problem: find a new policy that achieves a specific target return (better or worse) while remaining as close as possible to the original policy
Uses an iterative 'KL-pivoting' strategy where the reference policy is updated periodically to allow the agent to reach distant target returns while maintaining stability
Establishes a theoretical equivalence showing that optimizing for a 'best possible' counterfactual return is mathematically identical to standard Trust Region Policy Optimization (TRPO)

Architecture

Pseudocode for the Counterfactual Explanation Policy Optimization loop

Evaluation Highlights

Faithfully generates counterfactual policies matching target returns (e.g., achieving -996.9 for target -1000 in Pendulum-v1) across 5 OpenAI Gym environments
Demonstrates 'unlearning' capabilities: generated policies for LunarLander show exactly how to fail (e.g., free fall, missing flags) to reach specific negative return targets
Qualitative analysis reveals distinct behavioral modes: BipedalWalker counterfactuals explicitly show 'upright walking' for high returns vs 'dragging/kneeling' for low returns

Breakthrough Assessment

7/10

Novel formulation of RL explanations as policy optimization problems with strong theoretical links to TRPO. While tested on standard control tasks, it opens a new direction for contrastive policy analysis.

⚙️ Technical Details

Problem Definition

Setting: Finite horizon Markov Decision Process (MDP)

Inputs: Original policy π0, desired target return R_target

Outputs: Counterfactual policy π_cf that achieves R_target with minimal deviation from π0

Pipeline Flow

Input Policy & Target -> Objective Definition -> Gradient Estimation -> Iterative Update -> Output Policy

System Modules

Objective Definition

Defines loss function combining return error (L_ret) and proximity penalty (L_KL)

Model or implementation: Loss function: ||J_π - R_target|| + k * D_KL(π0 || π)

Gradient Estimator (Optimization Loop)

Estimates gradients for the return and KL terms using samples

Model or implementation: On-policy Monte Carlo Policy Gradients

KL-Pivoting Updater (Optimization Loop)

Updates the reference 'pivot' policy every m steps to allow larger shifts toward the target

Model or implementation: Iterative constraint relaxation

Novel Architectural Elements

Interpretation of Trust Region methods as a specific case of Counterfactual Explanation Policies where R_target is maximized

Modeling

Base Model: Policy Neural Networks (architecture depends on environment, typically MLPs for control tasks)

Training Method: On-policy Monte Carlo Policy Gradients with KL regularization

Objective Functions:

Purpose: Minimize difference between current and target return.

Formally: L_ret = ||J_π - R_target||_p
Purpose: Keep new policy close to the reference policy.

Formally: L_KL = D_KL(π0 || π)
Purpose: Combine objectives.

Formally: argmin_θ (L_ret + k * L_KL)

Adaptation: Fine-tuning weights of the pre-trained policy network

Key Hyperparameters:

KL_regularization_coefficient_k: Varies by env: {10, 1, 10^5, 10, 1}
stopping_threshold_delta: Varies by env: {10, 2.5, 37.5, 5, 10}
pivoting_iterations_m: 10 (5 for BipedalWalker)
+ 1 more
rollout_episodes_N: 10 (2 for BipedalWalker)

Compute: Single NVIDIA A100 (40GB) GPU

Comparison to Prior Work

vs. Saliency/Attribution: COUNTERPOL generates a full *policy* as explanation rather than highlighting input features
vs. TRPO/PPO: COUNTERPOL allows targeting *specific* returns (including lower ones for unlearning), whereas TRPO only maximizes return. Theoretical proof shows TRPO is a special case of COUNTERPOL where target = max_return.
vs. Causal RL (Gershman, 2017) [not cited in paper]: Focuses on modifying the policy for explainability rather than learning causal structure of the environment

Limitations

KL penalty conservatively restricts policy exploration, requiring heuristic 'pivoting' to reach distant targets
Current implementation uses sample-inefficient on-policy Monte Carlo gradients
Analysis limited to standard OpenAI Gym control environments (no high-dimensional visual tasks like Atari)
Requires defining a meaningful target return R_target, which may be non-trivial in some domains

Reproducibility

Code provided in supplementary material (but no public URL listed in paper text). Hyperparameters for all 5 environments (delta, k, m, N) are explicitly listed. Uses standard Stable-Baselines3 implementations for base agents.

📊 Experiments & Results

Evaluation Setup

Generating counterfactual policies for pre-trained agents (A2C/PPO) to hit specific target returns (higher and lower)

Benchmarks:

CartPole-v1 (Classic Control)
Acrobot-v1 (Classic Control)
Pendulum-v1 (Classic Control)
LunarLander-v2 (Box2D Control)
BipedalWalker-v3 (Box2D Control)

Metrics:

Return (J_π): Achieved average return of the counterfactual policy
Number of outer policy updates (n_π): Convergence speed
Statistical methodology: Experiments run with 3 different seeds; means and standard deviations reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Optimization accuracy results showing COUNTERPOL consistently hits diverse target returns across control environments.
CartPole-v1	Return (J_π)	450	450.6	+0.6
CartPole-v1	Return (J_π)	50	48.6	-1.4
Pendulum-v1	Return (J_π)	-1000	-996.9	+3.1
Acrobot-v1	Return (J_π)	-80	-80.6	-0.6

Experiment Figures

Visual trajectories of LunarLander-v2 policies: Original vs. Counterfactuals targeting higher (100, 150) and lower (0, -50) returns

BipedalWalker-v3 gait analysis for Original vs. Improved (Target 150) vs. Worsened (Target 50)

Main Takeaways

Optimization Reliability: The framework reliably converges to specified target returns (both higher and lower than original) across all tested environments
Contrastive Behavior: Qualitative analysis shows distinct behavioral changes; e.g., LunarLander 'free falls' to achieve low returns vs 'decelerated landing' for high returns
Unlearning Utility: The ability to target low returns systematically allows the extraction of policies that have 'unlearned' skills (e.g., inducing a slow start or dragging legs)
Proximity: The generated policies visually resemble the original policies (e.g., similar gait style in BipedalWalker) while adjusting specific mechanics to meet the return target

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy Gradients)
Constrained Optimization (Lagrangian multipliers)
KL Divergence

Key Terms

counterfactual explanation: An explanation describing the minimal change required to an input (or in this case, a policy) to produce a different specified outcome

trust region methods: Optimization techniques (like TRPO/PPO) that restrict policy updates to a specific neighborhood to ensure stability and monotonic improvement

KL-pivoting: An iterative update strategy where the reference policy (pivot) for the distance constraint is updated periodically, allowing the search to move further from the original start point

proximal operator: A mathematical tool used in optimization to solve problems by keeping the solution close to a previous point, often using a distance penalty

A2C: Advantage Actor-Critic—a synchronous, deterministic variant of the A3C reinforcement learning algorithm

PPO: Proximal Policy Optimization—a policy gradient method that uses a clipped objective function to keep updates within a trust region

TRPO: Trust Region Policy Optimization—an RL algorithm that guarantees monotonic improvement by enforcing a hard constraint on the KL divergence between policies