CoDeMDP: Composite Delayed Reward MDP—a framework where rewards depend on entire sequences and may not be a simple sum of immediate state-action rewards.
Non-Markovian Reward: A reward that depends on the history of states and actions, not just the current state.
In-sequence Attention: An attention mechanism restricted to the specific sequence associated with a delayed reward, used to determine the contribution weight of each step.
Causal Transformer: A transformer model that processes data sequentially, ensuring predictions at time t only depend on inputs from time 0 to t.
Instance-level Reward: The reward assigned to a single specific state-action pair (time step), as opposed to the delayed reward given for a whole sequence.
Composite Delayed Reward: A single reward signal provided for a sequence of actions, potentially derived from a complex aggregation (e.g., weighted sum, max, min) of underlying components.