Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

📝 Paper Summary

Reinforcement Learning with Delayed Rewards Reward Modeling

CoDeTr models delayed feedback as a weighted sum of non-Markovian components using an in-sequence attention mechanism to capture the disproportionate impact of critical moments.

Core Problem

Existing delayed reward methods assume rewards are Markovian (depend only on current state) and additive (equal-weighted sum), failing to capture complex dependencies and critical moments in real-world feedback.

Why it matters:

Human evaluators often focus on pivotal moments rather than weighing all steps equally, violating standard additive assumptions.
Real-world rewards frequently depend on trajectory history (non-Markovian) rather than just the immediate state-action pair.
Current methods misallocate credit in scenarios where specific actions disproportionately influence the final outcome, leading to suboptimal policy learning.

Concrete Example: In high-stakes environments like firefighting, experts focus intensely on a few critical cues that determine the outcome. Traditional methods treating every moment as equally contributing to the final delayed reward fail to identify these key turning points.

Key Novelty

Composite Delayed Reward Transformer (CoDeTr)

Models sequence-level rewards as a weighted sum of non-Markovian instance rewards, where weights are learned rather than fixed.
Uses a causal transformer to capture historical context for each step, ensuring the reward model understands temporal dependencies.
Applies an in-sequence attention mechanism to assign varying importance to different time steps, allowing the model to focus on critical moments within a trajectory.

Evaluation Highlights

Outperforms state-of-the-art delayed reward baselines (HC-Decomposition, IRCR, LIRPG) on MuJoCo locomotion tasks with composite delayed rewards.
Accurately recovers the underlying importance of specific time steps, assigning higher attention weights to critical intervals compared to uniform baselines.
Demonstrates robust performance even when the delayed reward function involves complex, non-linear aggregations like min/max operations over the sequence.

Breakthrough Assessment

7/10

addresses a significant gap in RL by relaxing the restrictive Markovian and additive assumptions for delayed rewards. The transformer-based solution is intuitive and effective, though evaluation is primarily on standard MuJoCo tasks modified for this setting.

⚙️ Technical Details

Problem Definition

Setting: Composite Delayed Reward Markov Decision Process (CoDeMDP), generalizing MDPs to include non-Markovian, non-additive delayed rewards.

Inputs: Trajectories of state-action pairs and periodic delayed reward signals.

Outputs: Predicted instance-level rewards for each time step to guide policy learning.

Pipeline Flow

Input Trajectory Processing
Causal Encoding
Reward & Weight Prediction
Composite Aggregation

System Modules

Causal Transformer Encoder

Encodes the sequence of state-action pairs while preserving temporal order.

Model or implementation: GPT-based architecture

Instance Reward Head (Reward & Weight Prediction)

Predicts the raw non-Markovian reward value for each step based on its embedding.

Model or implementation: Linear transformation

In-Sequence Attention Head (Reward & Weight Prediction)

Calculates the importance weight of each step within the sequence context.

Model or implementation: Bi-directional attention mechanism (Query-Key dot product)

Composite Aggregator

Aggregates weighted instance rewards to predict the total delayed reward for training.

Model or implementation: Weighted Summation

Novel Architectural Elements

Dual-head architecture on top of causal transformer: one head for instance reward value prediction, another using in-sequence attention for importance weighting.
Hybrid attention scheme: Causal masking for feature encoding (history dependency) combined with bi-directional attention for weight calculation (sequence context dependency).

Modeling

Base Model: GPT-style Causal Transformer

Training Method: Supervised Learning for Reward Model, Policy Optimization for Agent

Objective Functions:

Purpose: Train the reward model to accurately predict the delayed composite reward.

Formally: MSE Loss = (R_co(tau) - R_hat_co(tau))^2
Purpose: Train the policy using the redistributed instance-level rewards.

Formally: Standard RL objective (e.g., PPO/SAC) using predicted r_hat_t.

Adaptation: Joint training of reward model and policy

Key Hyperparameters:

history_length: H (past transitions provided to model)
embedding_dimension: d

Compute: Not reported in the paper

Comparison to Prior Work

vs. HC-Decomposition: CoDeTr adds learnable attention weights to handle non-uniform contributions, whereas HC assumes equal weighting.
vs. IRCR: CoDeTr models non-Markovian dependencies via transformer history, whereas IRCR assumes Markovian rewards.
vs. RUDDER [not cited in paper]: CoDeTr explicitly models the composite structure (weights * values) rather than just redistributing the sum via return decomposition.

Limitations

Relies on the availability of periodic delayed rewards; sparse or extremely long-term feedback might still be challenging.
Computational overhead of the transformer model compared to simpler MLP-based reward estimators.
Evaluation is limited to simulated locomotion tasks; applicability to complex real-world human feedback scenarios is not empirically tested.
Requires defining a specific sequence length or reset period for the delayed reward signal.

Reproducibility

Code: https://anonymous.4open.science/r/CoDe-67E8/

Code is available at anonymous link. Hyperparameters for the specific experiments (learning rates, batch sizes) are not fully detailed in the main text provided.

📊 Experiments & Results

Evaluation Setup

MuJoCo locomotion tasks (HalfCheetah, Ant, Hopper, Walker2d) modified with composite delayed rewards.

Benchmarks:

MuJoCo (Continuous Control / Locomotion)

Metrics:

Average Return
Reward Prediction Error (MSE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper claims CoDeTr consistently outperforms baseline methods across evaluated metrics, but specific numeric tables are not included in the provided text. The text states 'experimental results indicate that CoDeTr consistently outperforms baseline methods'. No specific numbers could be extracted.

Main Takeaways

CoDeTr consistently outperforms baselines (HC-Decomposition, IRCR, LIRPG) in environments with composite delayed rewards, validating the importance of modeling non-Markovian and non-uniform contributions.
The attention mechanism effectively identifies critical time steps, assigning higher weights to significant moments compared to the uniform weighting of baselines.
The method is robust to different types of composite functions (e.g., weighted sums, complex aggregations), whereas baselines relying on simple additive assumptions degrade in performance.
Qualitative analysis shows the predicted rewards closely track the true environmental feedback structure.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy Gradients)
Transformer Architectures (Self-Attention, Causal Masking)
Delayed Reward Credit Assignment

Key Terms

CoDeMDP: Composite Delayed Reward MDP—a framework where rewards depend on entire sequences and may not be a simple sum of immediate state-action rewards.

Non-Markovian Reward: A reward that depends on the history of states and actions, not just the current state.

In-sequence Attention: An attention mechanism restricted to the specific sequence associated with a delayed reward, used to determine the contribution weight of each step.

Causal Transformer: A transformer model that processes data sequentially, ensuring predictions at time t only depend on inputs from time 0 to t.

Instance-level Reward: The reward assigned to a single specific state-action pair (time step), as opposed to the delayed reward given for a whole sequence.

Composite Delayed Reward: A single reward signal provided for a sequence of actions, potentially derived from a complex aggregation (e.g., weighted sum, max, min) of underlying components.