Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

📝 Paper Summary

Offline Reinforcement Learning Off-Dynamics Reinforcement Learning Sim-to-Real Transfer

REAG enables Decision Transformers to learn optimal target policies from source domain data with mismatched dynamics by augmenting the source returns to match the target return distribution.

Core Problem

Training policies in a source environment with different dynamics (e.g., simulation) often fails in the target environment (e.g., real world) due to dynamics shifts, and existing reward augmentation methods for dynamic programming do not work for return-conditioned supervised learning.

Why it matters:

Collecting data in target environments (e.g., medical treatment, autonomous driving) is often costly, unethical, or dangerous, necessitating training on safer source domains
Previous augmentation methods (DARA) rely on trajectory matching that is incompatible with the return-conditioned nature of Decision Transformers, which generate a family of policies rather than a single optimal one
Directly applying source-trained policies to target environments leads to catastrophic failures due to the simulation-to-reality gap

Concrete Example: In a Hopper task where the source agent has a crippled leg (modified dynamics) but the target agent is healthy, a standard Decision Transformer trained on source data learns to compensate for the limp. When deployed on the healthy target agent, this compensation results in suboptimal, erratic movement because the expected return-to-go no longer aligns with the actual dynamics.

Key Novelty

Return Augmented Decision Transformer (REAG)

Augments the return-to-go labels in the abundant source dataset to align with the return distribution of the scarce target dataset, bridging the dynamics gap
Proposes two variants: REAG-DARA (derived from probabilistic trajectory matching) and REAG-MV (directly matching the mean and variance of return distributions via Laplace approximation)
Unlike reward augmentation which modifies immediate rewards, this modifies the *conditioning* variable (return-to-go), allowing the DT to learn a policy that generalizes across desired returns in the target environment

Architecture

Conceptual flow of the REAG framework: Source data returns are transformed via ψ(g) before training the DT.

Evaluation Highlights

REAG-MV improves Decision Transformer performance by ~15-30% over standard DT on D4RL MuJoCo tasks with mismatched dynamics (e.g., gravity changes, crippled joints)
Achieves comparable suboptimality to training directly on target data, despite using primarily source data with modified dynamics
REAG-MV consistently outperforms the REAG-DARA variant, showing that direct return distribution matching is more effective for RCSL than trajectory-likelihood matching

Breakthrough Assessment

7/10

First method to adapt Decision Transformers to off-dynamics settings via return augmentation. Theoretically grounded and empirically effective, though primarily evaluated on standard MuJoCo modifications.

⚙️ Technical Details

Problem Definition

Setting: Offline Off-Dynamics RL: Access to a large offline source dataset D_S (source dynamics P_S) and a small offline target dataset D_T (target dynamics P_T), sharing the same reward function R.

Inputs: Offline datasets containing trajectories (states, actions, rewards) from both source and target domains.

Outputs: A policy π that maximizes expected return in the target environment P_T.

Pipeline Flow

Data Collection: Gather large source dataset D_S and small target dataset D_T
Return Estimation: Estimate return distributions G_S and G_T for source and target datasets (Mean and Variance estimation)
Augmentation: Transform returns in D_S using transformation function ψ to create D_S_tilde
Training: Train Decision Transformer on combined dataset D_T ∪ D_S_tilde
Inference: Condition trained DT on high target return to generate actions

System Modules

Return Estimator (Preprocessing)

Estimate the conditional return distribution parameters (mean μ and variance σ) for state-action pairs in both domains

Model or implementation: MLP Regressors

Return Transformer (Preprocessing)

Apply transformation ψ to source returns g_S to align them with target returns g_T

Model or implementation: Analytical Function (Linear Transformation)

Decision Transformer

Learn policy π(a|s, g) via sequence modeling on the augmented dataset

Model or implementation: GPT-2 based Transformer

Novel Architectural Elements

Return Augmentation Module: A preprocessing step that explicitly maps source returns to target returns using moment matching (REAG-MV) or likelihood ratio (REAG-DARA) before feeding into the DT

Modeling

Base Model: Decision Transformer (based on GPT-2)

Training Method: Supervised Learning (Sequence Modeling) with Augmented Labels

Objective Functions:

Purpose: Maximize likelihood of actions given states and (augmented) returns.

Formally: L = - Σ log π(a_t | s_t, g_augmented)

Training Data:

Source: Modified MuJoCo environments (gravity, friction, joint damage)
Target: Original D4RL MuJoCo environments
Ratio: Source dataset much larger than Target dataset

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 64
context_length: 20
+ 3 more
activation: ReLU
dropout: 0.1
max_epochs: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DARA: REAG targets RCSL/DT architectures by augmenting *returns* rather than *rewards*, avoiding the need for optimal trajectory assumptions incompatible with RCSL
vs. Standard DT: REAG explicitly corrects for dynamics mismatch via data augmentation, whereas DT assumes training and test dynamics are identical
vs. H2O: REAG adapts the data labels rather than filtering or weighting the loss function
+ 1 more
vs. BOSA [not cited in paper]: BOSA handles dynamics shifts via constrained policy optimization; REAG handles it via supervised label mapping

Limitations

Relies on the assumption that return distributions can be approximated by Gaussians (Laplace approximation) for REAG-MV
Requires a small target dataset to estimate target statistics; cannot work in zero-shot target settings
Performance depends on the quality of the estimated mean and variance models for the return distributions
Experiments limited to MuJoCo locomotion tasks; no high-dimensional visual or real-world robot experiments

Reproducibility

Code availability is not explicitly provided in the paper text or abstract. Mathematical derivations for the augmentation strategies are detailed in the main text and appendix.

📊 Experiments & Results

Evaluation Setup

Offline RL on D4RL MuJoCo benchmarks with modified source dynamics (gravity, friction, joint health)

Benchmarks:

Walker2d (Locomotion)
Hopper (Locomotion)
HalfCheetah (Locomotion)

Metrics:

Normalized Score (0-100 scale based on D4RL reference)
Statistical methodology: Means and standard deviations reported over 4 random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
REAG-MV consistently improves performance across different dynamics shifts (gravity, friction, thigh-broken) compared to standard DT and REAG-DARA.
Walker2d-gravity	Normalized Score	46.2	78.4	+32.2
Hopper-friction	Normalized Score	56.4	73.2	+16.8
HalfCheetah-thigh	Normalized Score	38.5	43.1	+4.6
Walker2d-gravity	Normalized Score	65.3	78.4	+13.1

Experiment Figures

Comparison of Normalized Scores for DT, REAG-DARA, and REAG-MV across multiple environments and shift types.

Main Takeaways

REAG-MV is the most robust method, consistently outperforming standard DT and the DARA-based variant across various dynamics mismatches.
Standard Decision Transformers struggle significantly with dynamics shifts, often performing worse than random if not adapted.
The simple mean-variance matching (REAG-MV) is more effective for RCSL than the complex likelihood-ratio based augmentation (REAG-DARA), likely because RCSL relies on the full return distribution rather than just the optimal trajectory likelihood.
Combining REAG with advanced DT architectures like Reinformer and QT further boosts performance, showing the method is architecture-agnostic.

📚 Prerequisite Knowledge

Prerequisites

Offline Reinforcement Learning
Decision Transformer (DT)
Markov Decision Processes (MDP)
Importance Sampling / Domain Adaptation

Key Terms

RCSL: Return-Conditioned Supervised Learning—a paradigm where policies are trained to generate actions conditional on a specified future return (e.g., Decision Transformer)

Off-Dynamics RL: Reinforcement learning where the training environment (source) has different transition dynamics than the deployment environment (target), but the same reward function

DARA: Dynamics-Aware Reward Augmentation—a prior method for dynamic programming RL that modifies rewards to account for dynamics shifts by matching trajectory distributions

DT: Decision Transformer—an offline RL algorithm that models RL as a sequence modeling problem, predicting actions given states and desired returns

REAG: Return Augmented DT—the proposed method that transforms source domain returns to match target domain statistics

Laplace approximation: A technique to approximate a probability distribution with a Gaussian centered at its mode; used here to model return distributions

Sim-to-Real gap: The difference in performance or behavior when transferring a policy from a simulation (source) to the real world (target) due to imperfect modeling