Offline Reinforcement Learning from Datasets with Structured Non-Stationarity

📝 Paper Summary

Offline Reinforcement Learning Non-stationary Environments Representation Learning for RL

COSPA handles slowly evolving non-stationarity in offline RL by using contrastive predictive coding to infer hidden environment parameters from trajectory histories and conditioning the policy on predictions.

Core Problem

Standard Offline RL fails when deployed in environments with non-stationary transition or reward functions caused by factors like wear-and-tear over long data collection periods.

Why it matters:

Real-world robots suffer from physical degradation (wear and tear) over time, invalidating the stationarity assumption of standard RL
Existing Bayes-Adaptive methods (like BOReL) rely on techniques like reward relabeling or simulators that are often unavailable in offline settings
Generative approaches (VAEs) struggle to model complex high-dimensional transition shifts compared to discriminative approaches

Concrete Example: Consider a robot collecting data over months where joint friction gradually increases. A standard offline policy treats all data as coming from one physics model, leading to failure when deployed. COSPA infers the specific friction level from recent trajectories to adjust the policy.

Key Novelty

Contrastive Predictive Non-Stationarity Adaptation (COSPA)

Treats the problem as a Dynamic-Parameter MDP where a hidden parameter (HiP) evolves between episodes but stays fixed within them
Uses Contrastive Predictive Coding (CPC) to learn a discriminative representation of the HiP by distinguishing future trajectories from random ones
Decouples inference and control: learns the HiP representation first, then trains a predictor for evaluation, and finally conditions a TD3+BC policy on the inferred HiP

Architecture

Conceptual diagram of the inference pipeline. It illustrates how trajectories from a deployment are encoded and used to predict future trajectories.

Evaluation Highlights

Outperforms the Oracle (which has access to ground truth parameters) on the high-dimensional Ant-Weight task (3104 vs 2750 return)
Achieves superior performance over Bayes-Adaptive baselines (BOReL, ContraBAR) in complex locomotion tasks like Barkour-Weight (+3.23 reward vs BOReL)
Demonstrates robust generalization in 1D-Goal task, exceeding Oracle performance (-18.59 vs -20.79 return)

Breakthrough Assessment

7/10

Identifies a realistic, under-explored problem setting (structured non-stationarity in offline RL) and provides a solid, pragmatic solution that outperforms relevant baselines, though the method is a combination of existing components (CPC + TD3).

⚙️ Technical Details

Problem Definition

Setting: Offline RL in a Dynamic-Parameter MDP (DP-MDP) where hidden parameter z evolves according to P(z_i | z_{0:i-1}) between episodes.

Inputs: A dataset of deployments D = {d_j}, where each deployment contains a sequence of trajectories generated by a behavior policy beta.

Outputs: A policy pi(a|s, tau_{i-Nc:i-1}) conditioned on a context of recent trajectories.

Pipeline Flow

Inference: Context Trajectories -> Encoder -> Latent Representations
Prediction: Latent Sequence -> Predictor RNN -> Predicted Next Latent
Control: State + Predicted Latent -> Policy -> Action

System Modules

Trajectory Encoder (Representation Learning)

Encodes a full trajectory into a compact latent representation

Model or implementation: 2-layer MLP with ReLU

Context Aggregator (Training only) (Representation Learning)

Summarizes past trajectory embeddings to predict future ones for CPC training

Model or implementation: Gated Recurrent Unit (GRU)

Next-Step Predictor

Predicts the latent parameter for the upcoming episode based on context

Model or implementation: 2-layer MLP followed by a GRU

Policy Network

Selects actions conditioned on state and predicted latent

Model or implementation: TD3+BC (MLP)

Novel Architectural Elements

Decoupled inference and prediction: learning a discriminative representation via CPC on trajectories first, then training a separate predictor for the next episode's parameter
Application of CPC at the trajectory level (discriminating future trajectories) rather than transition level to capture episode-level non-stationarity

Modeling

Base Model: Custom architecture using MLPs and GRUs

Training Method: Three-stage training: (1) Representation Learning via CPC, (2) Predictor Training via MSE, (3) Policy Training via TD3+BC

Objective Functions:

Purpose: Learn discriminative latent representations of trajectories.

Formally: Minimize L_InfoNCE = -E[log(exp(f(tau+, c)) / sum(exp(f(tau-, c))))]
Purpose: Predict the next latent parameter from history.

Formally: Minimize L_pred = E[(f_pred(z_history) - z_target)^2]
Purpose: Train offline policy with behavior regularization.

Formally: TD3+BC loss including E[(pi(s, z) - a)^2]

Training Data:

Datasets generated by a generic behavior policy (TD3 or PPO) trained with domain randomization (hidden parameter changes every episode)

Key Hyperparameters:

latent_dimension: 2, 4, or 8 (normalized)
BC_weight_lambda: Standard TD3+BC parameter, but decreased to allow more deviation from behavior policy

Compute: Not reported in the paper

Comparison to Prior Work

vs. BOReL: COSPA does not require reward relabeling or policy replaying, making it applicable when reward functions are unknown or sparse
vs. ContraBAR: COSPA discriminates at the trajectory level, avoiding the collapse to a transition model, and does not need hard negative mining via simulators
vs. VRNN/LILAC: COSPA uses a discriminative loss (InfoNCE) which handles high-dimensional dynamics shifts better than the generative reconstruction loss of VAEs

Limitations

Assumes access to context trajectories from the same deployment during evaluation
Performance degrades if the hidden parameter evolution is purely random (unpredictable) rather than structured
Relies on the assumption that non-stationarity is slow (constant within episodes)

Reproducibility

Code: https://github.com/JohannesAck/OfflineRLStructuredNonstationarity

Code is publicly available at https://github.com/JohannesAck/OfflineRLStructuredNonstationarity. Hyperparameters for representation learning were optimized via grid search. Datasets generated using BRAX and Barkour simulations.

📊 Experiments & Results

Evaluation Setup

Offline RL evaluation on continuous control tasks with evolving hidden parameters affecting rewards or transitions.

Benchmarks:

1D-Goal / 2D-Goal (Navigation (Reward Shift))
2D-Wind (Navigation (Dynamics Shift))
Ant-Weight / Ant-Leg (Locomotion (Dynamics Shift))
Barkour-Weight (Quadruped Locomotion (Dynamics Shift))

Metrics:

Evaluation Reward
Statistical methodology: Means and 95% confidence intervals across 20 trials.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on high-dimensional locomotion tasks with dynamics shifts.
Ant-Weight	Reward	2750	3104	+354
Barkour-Weight	Reward	14.90	18.13	+3.23
Performance on simple navigation tasks with reward shifts.
1D-Goal	Reward	-20.79	-18.59	+2.20
2D-Goal	Reward	-56.20	-36.39	+19.81

Experiment Figures

T-SNE visualizations of learned representations and linear probe accuracy for HiP recovery.

Ablation study on predictability. Shows reward vs. randomness of the hidden parameter transition.

Main Takeaways

COSPA consistently matches or exceeds Oracle performance, suggesting learned representations may capture task structure better than raw ground-truth parameters in some contexts
Baselines like BOReL and ContraBAR struggle in this specific setting because they rely on techniques (reward relabeling, hard negative mining) not applicable to the strict offline/no-sim constraints
Discriminative representation learning (CPC) proves more effective than generative approaches (VRNN) for high-dimensional dynamics shifts (e.g., Ant-Weight)
The method is robust to noisy ground-truth parameters, sometimes performing better than an Oracle provided with noisy labels

📚 Prerequisite Knowledge

Prerequisites

Offline Reinforcement Learning
Contrastive Learning (InfoNCE loss)
Markov Decision Processes (MDP)
Recurrent Neural Networks (GRU)

Key Terms

DP-MDP: Dynamic-Parameter Markov Decision Process—an MDP where transition/reward functions depend on a hidden parameter that evolves over time

HiP: Hidden Parameter—a latent variable (z) that characterizes the current environment dynamics (e.g., friction, mass)

CPC: Contrastive Predictive Coding—unsupervised learning method that learns representations by predicting future observations in a latent space

InfoNCE: Information Noise Contrastive Estimation—a loss function used in contrastive learning to maximize mutual information between inputs

TD3+BC: Twin Delayed Deep Deterministic Policy Gradient with Behavior Cloning—an offline RL algorithm that constrains the learned policy to stay close to the data-generating policy

HiP-MDP: Hidden-Parameter MDP—an MDP where the hidden parameter is sampled once and stays constant (unlike DP-MDP where it evolves)

BOReL: Bayes Adaptive Offline RL—a baseline method that infers hidden parameters using a Variational Autoencoder

ContraBAR: Contrastive Bayes-Adaptive Deep RL—a baseline method using CPC for belief state inference