Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

📝 Paper Summary

Reward Redistribution Causal Reinforcement Learning Return Decomposition

Generative Return Decomposition (GRD) recovers the underlying causal structure of Markovian rewards to decompose delayed returns into interpretable proxy rewards and identify a minimal state representation for efficient policy learning.

Core Problem

In reinforcement learning with delayed or episodic rewards, it is difficult to determine which specific state-action pairs contributed to the final outcome, leading to inefficient policy optimization.

Why it matters:

Standard RL struggles with sparse/delayed feedback because the credit assignment problem is ambiguous without immediate rewards
Existing return decomposition methods use uninterpretable models (like LSTMs) or hand-designed rules, failing to explain *why* a state-action pair is valuable
Including irrelevant state dimensions in the policy input adds noise and complexity, slowing down convergence in high-dimensional environments

Concrete Example: In the 'Ant' robot task, the robot has 111 state dimensions, but 84 are unused noise. A standard RL agent might struggle to learn which dimensions matter for the delayed episode reward, whereas GRD identifies that only specific dimensions causally affect the reward.

Key Novelty

Generative Return Decomposition (GRD)

Models the long-term return as the causal effect of a sequence of unobserved Markovian rewards within a generative process
Uses a Dynamic Bayesian Network (DBN) to explicitly learn binary masks representing causal edges between states, actions, and rewards
Derives a 'compact representation' (minimal sufficient state set) containing only state dimensions that causally influence the reward, filtering out noise for the policy

Architecture

The framework of GRD showing the interplay between the Generative Model (causal discovery) and the Policy Model.

Evaluation Highlights

Outperforms state-of-the-art RRD and IRCR on 8 MuJoCo tasks with episodic rewards, achieving ~1.5x higher return on HalfCheetah
Achieves superior sample efficiency in high-dimensional state spaces (e.g., HumanoidStandup with 376 dims) by filtering irrelevant features
Demonstrates robustness to Gaussian noise added to irrelevant state dimensions, maintaining performance while baselines degrade

Breakthrough Assessment

8/10

Strong theoretical grounding in causal identifiability combined with practical state-of-the-art performance on standard benchmarks. The ability to interpret *why* rewards are assigned via causal graphs is a significant advance over black-box return decomposition.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) with delayed rewards, where the agent observes trajectory-wise return R but not individual Markovian rewards r_t

Inputs: Trajectories of (state, action) pairs and a final episodic/delayed return R

Outputs: Decomposed per-step proxy rewards r_t and a learned policy π optimized on a compact state representation

Pipeline Flow

Causal Discovery: Learn binary masks for state-to-reward, action-to-reward, and dynamics dependencies
Reward & Dynamics Learning: Train functions to predict return and next states using masked inputs
Representation Extraction: Identify 'compact representation' (s_min) based on learned causal masks
Policy Optimization: Train SAC agent using s_min as input and predicted proxy rewards as supervision

System Modules

Causal Structure Learner (phi_cau) (Generative Model Estimation)

Learn binary masks representing causal edges

Model or implementation: Learnable parameters representing Bernoulli distributions + Gumbel-Softmax

Reward Predictor (phi_rew) (Generative Model Estimation)

Approximate the Markovian reward function g

Model or implementation: Fully-connected network

Dynamics Predictor (phi_dyn) (Generative Model Estimation)

Approximate transition dynamics f to aid causal discovery

Model or implementation: Mixture Density Network

Policy Agent (phi_pi)

Learn optimal actions using decomposed rewards

Model or implementation: Soft Actor-Critic (SAC)

Novel Architectural Elements

Integration of learnable binary causal masks directly into the reward and dynamics prediction networks
Use of 'Compact Representation' (mask-filtered state) as the explicit input for the policy network, structurally enforcing causal feature selection

Modeling

Base Model: Soft Actor-Critic (SAC) for policy; MLP + Gumbel-Softmax for causal model

Training Method: Joint optimization of Generative Model (phi_m) and Policy (phi_pi)

Objective Functions:

Purpose: Ensure decomposed rewards sum up to the observed episodic return.

Formally: L_rew = E[ || R - sum(gamma^{t-1} * r_hat_t) ||^2 ]
Purpose: Maximize likelihood of next state prediction given dynamics model.

Formally: L_dyn = - sum(log P(s_{t+1} | s_t, a_t, masks))
Purpose: Encourage sparsity in the causal graph structure.

Formally: L_reg = sum(log P(edge_exists))
Purpose: Maximize policy expected return and entropy (SAC objective).

Formally: J_pi = E[ D_KL( pi || exp(Q - V) ) ]

Key Hyperparameters:

regularization_lambdas: {'lambda_1 (state-to-reward)': '5e-6', 'lambda_2 (action-to-reward)': '5e-6', 'lambda_3 (state-to-state)': '5e-4', 'lambda_4 (self-loop)': '5e-4', 'lambda_5 (action-to-state)': '5e-4'}
discount_factor_gamma: 0.99
episode_length: 1000
+ 2 more
batch_size: Not reported in the paper
learning_rate: Not explicitly reported in the paper (likely standard SAC defaults)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RUDDER/RRD: GRD explicitly models the *structure* of reward generation (which state dims matter), providing interpretability and a compact state for policy learning, whereas RUDDER/RRD use black-box models.
vs. IRCR: GRD learns non-uniform, state-dependent rewards based on causal influence, whereas IRCR often results in simpler redistribution profiles.
vs. Causal RL (e.g., Huang et al. 2021) [not cited in paper]: GRD specifically targets the *delayed reward* setting with return decomposition, rather than just transfer or dynamics learning.

Limitations

Assumes the underlying causal structure is a DAG and time-invariant (stationary reward function)
Assumes no unobserved confounders in the MDP
Requires learning a dynamics model, which can be difficult in very complex environments
Relies on the return decomposition assumption (sum of rewards = return)

Reproducibility

Code: https://reedzyd.github.io/GenerativeReturnDecomposition/

Code is publicly available at project page. Hyperparameters for regularization (lambdas) are provided in Appendix. Specific architecture sizes (layer counts/widths) for the causal modules are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

MuJoCo locomotion tasks where the agent only receives a single scalar reward at the end of the episode (accumulated return).

Benchmarks:

MuJoCo (HalfCheetah, Ant, Walker2d, Humanoid, Swimmer, Hopper, HumanoidStandup, Reacher) (Continuous control with episodic rewards)

Metrics:

Average Accumulative Reward
Statistical methodology: Mean and standard deviation over 5 seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on MuJoCo tasks with episodic rewards shows GRD consistently outperforming baselines.
HalfCheetah-v2	Average Accumulative Reward	12500	15000	+2500
Ant-v2	Average Accumulative Reward	6000	6500	+500
HumanoidStandup-v2	Average Accumulative Reward	150000	230000	+80000
Ablation study demonstrating the impact of the Compact Representation (CR).
HalfCheetah-v2	Average Accumulative Reward	13000	15000	+2000
Ant-v2	Average Accumulative Reward	5000	6500	+1500

Experiment Figures

Learning curves (Accumulative Reward vs Time Steps) for 8 MuJoCo tasks.

Heatmaps of the learned causal matrices at different training stages (1e4 to 1e6 steps) for the Ant task.

Comparison of decomposed proxy rewards vs ground truth rewards over time.

Main Takeaways

GRD consistently outperforms RRD and IRCR across all tested MuJoCo tasks in the episodic reward setting.
The use of 'Compact Representation' (filtering state inputs based on causal masks) provides a significant performance boost over using the full state, especially in high-dimensional tasks.
The method correctly recovers the ground truth reward structure (as seen in visualizations for Ant), filtering out irrelevant noise dimensions.
GRD is robust to Gaussian noise added to irrelevant state dimensions, whereas baselines degrade, proving the efficacy of the causal filtering.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP)
Causal discovery / Dynamic Bayesian Networks (DBN)
Return decomposition hypothesis
Soft Actor-Critic (SAC)

Key Terms

Markovian reward: The immediate, unobserved reward r_t explicitly associated with a state-action pair at time t, as opposed to the delayed accumulated return

return decomposition: Techniques to break down a single long-term return value into a sequence of individual proxy rewards for each time step

compact representation: A subset of state dimensions identified as causally relevant to the reward function, used to reduce the input space for the policy

Gumbel-Softmax: A reparameterization trick allowing differentiable sampling from categorical distributions, used here to learn binary causal masks

RUDDER: A baseline method that uses LSTMs to redistribute rewards to key events in a sequence

MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker

DBN: Dynamic Bayesian Network—a graphical model that relates variables to each other over adjacent time steps