Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

📝 Paper Summary

Offline Meta-Reinforcement Learning Transformer-based RL

Meta-DT achieves efficient offline meta-RL generalization by conditioning a decision transformer on robust task representations from a context-aware world model and self-guided prompts that maximize prediction error.

Core Problem

Offline RL agents struggle to generalize to unseen tasks because training data is biased by behavior policies, and existing methods require expensive domain knowledge (like expert demos) at test time.

Why it matters:

Current RL agents fail to generalize like Large Language Models (LLMs) due to distribution shifts and lack of effective self-supervised pretraining.
Relying on expert demonstrations or hindsight statistics at test time is impractical for real-world unseen tasks where such data is unavailable.
Behavior policies in offline datasets are often entangled with task information, causing agents to learn policy-specific biases rather than true task dynamics.

Concrete Example: In 2D navigation, if training tasks always have agents moving directly to goals, an agent might learn 'move straight' (behavior policy feature) rather than 'move to the star' (task goal). At test time, if the behavior shifts or the goal location is new, the agent fails to extrapolate because it memorized the behavior policy instead of the task dynamics.

Key Novelty

Meta-DT (Meta Decision Transformer)

Uses a 'Context-Aware World Model' to disentangle task dynamics from behavior policies, learning a compact task representation invariant to the data collection policy.
Injects this task representation into a Causal Transformer to guide generation, enabling the model to distinguish between different environments.
Constructs a 'Self-Guided Prompt' by selecting past trajectory segments with the highest prediction error, intentionally feeding the model the most informative/surprising context to refine its task belief.

Architecture

Overview of the Meta-DT framework, illustrating the two-stage process: World Model Pretraining and Meta-DT Training/Inference.

Evaluation Highlights

Achieves superior zero-shot and few-shot generalization on MuJoCo and Meta-World benchmarks compared to strong baselines like Prompt-DT and MACAW.
Outperforms Prompt-DT by significant margins in sparse-reward settings where context is scarce, without requiring expert demonstrations.
Demonstrates robustness to data quality, maintaining high performance even when trained on medium or mixed-quality datasets where other methods degrade.

Breakthrough Assessment

8/10

Significantly advances offline meta-RL by removing the need for expert demos at test time while improving generalization. The 'error-maximizing' prompt selection is a clever, counter-intuitive mechanism for active task identification.

⚙️ Technical Details

Problem Definition

Setting: Offline Meta-RL where tasks follow a distribution P(M), sharing state/action spaces but differing in reward/transition functions.

Inputs: Offline datasets D_i from N training tasks; at test time, a small context dataset D (few-shot) or direct interaction (zero-shot).

Outputs: A meta-policy π_meta that maximizes expected return on unseen test tasks.

Pipeline Flow

Context-Aware World Model Pretraining
Task Representation Injection
Self-Guided Prompt Construction
Causal Transformer Inference

System Modules

Context Encoder (World Model)

Abstracts recent experience into a compact latent task representation z

Model or implementation: MLP-based encoder

World Model Decoder (World Model)

Predicts rewards and next states to ensure z captures task dynamics

Model or implementation: MLP-based reward and transition predictors

Prompt Selector

Selects the trajectory segment with the highest world model prediction error to serve as context

Model or implementation: Heuristic selection based on pre-trained World Model error

Causal Transformer

Generates actions autoregressively conditioned on task representation and history

Model or implementation: GPT-2 style transformer

Novel Architectural Elements

Integration of a frozen, pre-trained context-aware world model directly into the Decision Transformer's context stream.
Self-guided prompt mechanism that selects context based on 'maximum prediction error' rather than random sampling or expert selection.

Modeling

Base Model: GPT-2 (3 layers, 4 attention heads, 128 embedding dim)

Training Method: Two-stage training: (1) World Model pretraining via supervised regression, (2) Meta-DT training via sequence modeling objective.

Objective Functions:

Purpose: Train world model to capture dynamics.

Formally: Minimize MSE of reward and next-state predictions conditioned on latent z.
Purpose: Train transformer policy.

Formally: Maximize log-likelihood of actions in the dataset (MSE loss for continuous actions).

Training Data:

MuJoCo (HalfCheetah-Vel, Ant-Dir, etc.)
Meta-World (ML1, ML10, ML45)

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 256
context_length_K: 20
+ 3 more
embedding_dimension: 128
activation: ReLU
dropout: 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Prompt-DT: Meta-DT does not require expert demonstrations at test time; uses self-generated history selected by error maximization.
vs. MACAW/CORRO: Meta-DT uses a transformer-based sequence modeling approach rather than traditional value-based RL or contrastive learning.
vs. Generalized DT [cited]: Meta-DT uses a learned world model for context rather than statistical hindsight information.

Limitations

Computational cost of pretraining the world model adds overhead compared to model-free approaches.
Reliance on the assumption that world model prediction error correlates with informative task features.
Evaluated primarily on continuous control tasks (MuJoCo, Meta-World); applicability to discrete or image-based domains not fully explored in main results.

Reproducibility

Code: https://github.com/NJU-RL/Meta-DT

Code is publicly available at https://github.com/NJU-RL/Meta-DT. Hyperparameters and dataset details are provided in the paper and appendix.

📊 Experiments & Results

Evaluation Setup

Offline meta-training on multitask datasets, followed by few-shot or zero-shot evaluation on unseen tasks.

Benchmarks:

MuJoCo (Continuous control (locomotion))
Meta-World (Robotic manipulation (multi-task))

Metrics:

Average Return (normalized)
Success Rate (for Meta-World)
Statistical methodology: Means and standard deviations reported over 4 random seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MuJoCo (Ant-Dir)	Average Return	443.3	596.2	+152.9
MuJoCo (Cheetah-Vel)	Average Return	-65.3	-60.2	+5.1
Meta-World (ML45)	Success Rate	0.45	0.78	+0.33
MuJoCo (Ant-Dir)	Average Return	136.9	304.5	+167.6

Experiment Figures

Comparison of prompt selection strategies (Random vs. Recent vs. High-Reward vs. Complementary).

Main Takeaways

Meta-DT consistently outperforms Prompt-DT and other baselines across various data qualities (Expert, Medium, Replay), indicating superior robustness.
The self-guided prompt mechanism allows effective adaptation without external expert demonstrations, making it more practical for real-world scenarios where test-time experts are unavailable.
Ablation studies confirm that both the context-aware world model and the complementary prompt selection strategy are critical to performance; removing either leads to significant drops.
The method scales well to hard multi-task settings like Meta-World ML45, showing strong potential for generalist agent learning.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, returns, policies)
Transformer architecture (attention, sequence modeling)
Meta-Learning concepts (task distributions, few-shot adaptation)
World Models (dynamics learning)

Key Terms

World Model: A model that learns the environment's dynamics (transition and reward functions), allowing the agent to simulate or understand the environment without interacting with it.

Offline RL: Learning optimal policies from static datasets of previously collected experience without interacting with the environment during training.

Meta-RL: Meta-Reinforcement Learning—learning a learning algorithm or policy that can quickly adapt to new, unseen tasks.

Decision Transformer (DT): An approach that casts RL as a sequence modeling problem, using transformers to predict actions given states and desired returns.

Inductive Bias: Assumptions built into a learning algorithm that help it generalize to new data (e.g., using prediction error to identify informative trajectory segments).

Disentanglement: Separating different factors of variation in the data—here, separating task-specific dynamics (environment) from behavior-specific features (policy).

Causal Transformer: A transformer model that attends only to past and current tokens (masking future tokens) to respect temporal causality.

Zero-shot Generalization: The ability to perform a task without any prior specific examples or fine-tuning on that specific task.

Few-shot Generalization: The ability to adapt to a new task given only a small number of examples (context).