← Back to Paper List

Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

Yuting Tang, Xin-Qiang Cai, Jing-Cheng Pang, Qiyu Wu, Yao-Xiang Ding, Masashi Sugiyama
arXiv (2024)
RL

📝 Paper Summary

Reinforcement Learning with Delayed Rewards Reward Modeling
CoDeTr models delayed feedback as a weighted sum of non-Markovian components using an in-sequence attention mechanism to capture the disproportionate impact of critical moments.
Core Problem
Existing delayed reward methods assume rewards are Markovian (depend only on current state) and additive (equal-weighted sum), failing to capture complex dependencies and critical moments in real-world feedback.
Why it matters:
  • Human evaluators often focus on pivotal moments rather than weighing all steps equally, violating standard additive assumptions.
  • Real-world rewards frequently depend on trajectory history (non-Markovian) rather than just the immediate state-action pair.
  • Current methods misallocate credit in scenarios where specific actions disproportionately influence the final outcome, leading to suboptimal policy learning.
Concrete Example: In high-stakes environments like firefighting, experts focus intensely on a few critical cues that determine the outcome. Traditional methods treating every moment as equally contributing to the final delayed reward fail to identify these key turning points.
Key Novelty
Composite Delayed Reward Transformer (CoDeTr)
  • Models sequence-level rewards as a weighted sum of non-Markovian instance rewards, where weights are learned rather than fixed.
  • Uses a causal transformer to capture historical context for each step, ensuring the reward model understands temporal dependencies.
  • Applies an in-sequence attention mechanism to assign varying importance to different time steps, allowing the model to focus on critical moments within a trajectory.
Evaluation Highlights
  • Outperforms state-of-the-art delayed reward baselines (HC-Decomposition, IRCR, LIRPG) on MuJoCo locomotion tasks with composite delayed rewards.
  • Accurately recovers the underlying importance of specific time steps, assigning higher attention weights to critical intervals compared to uniform baselines.
  • Demonstrates robust performance even when the delayed reward function involves complex, non-linear aggregations like min/max operations over the sequence.
Breakthrough Assessment
7/10
addresses a significant gap in RL by relaxing the restrictive Markovian and additive assumptions for delayed rewards. The transformer-based solution is intuitive and effective, though evaluation is primarily on standard MuJoCo tasks modified for this setting.
×