Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning

📝 Paper Summary

Offline Reinforcement Learning Model-Based Reinforcement Learning (MBRL)

MOREC improves offline model-based RL by learning a reward function that identifies high-fidelity transitions, using it to filter out erroneous model rollouts during policy learning.

Core Problem

Learned dynamics models in offline RL struggle to generalize to unseen transitions (out-of-distribution errors) because they are trained via supervised learning on limited historical data.

Why it matters:

Model errors in out-of-distribution regions are often exploited by policy optimization, leading to poor real-world performance
Existing conservative methods limit policy exploration or pessimistically penalize the model, which restricts the potential performance gains from model-based approaches
Current methods fail to distinguish between accurate and inaccurate model predictions when generating long-horizon rollouts

Concrete Example: In a refrigerator temperature control task, a dynamics model trained on a specific behavior policy accurately predicts temperatures for that policy but fails when a new test policy visits unseen states. Without MOREC, the model generates erroneous temperature predictions that diverge from reality, causing the policy to fail.

Key Novelty

Model-based Offline Reinforcement learning with Reward Consistency (MOREC)

Conceptualizes the environment dynamics as an agent maximizing a hidden 'dynamics reward' (consistency signal), which can be recovered via Inverse Reinforcement Learning (IRL)
Uses this learned dynamics reward to filter transitions during model rollouts: instead of accepting any predicted next state, the system samples multiple candidates and selects the one with the highest dynamics reward
Integrates seamlessly into existing model-based algorithms (like MOPO or MOBILE) by simply replacing the standard rollout mechanism with this reward-consistent filtering

Architecture

The MOREC framework pipeline compared to standard offline MBRL.

Evaluation Highlights

Outperforms prior SOTA methods on 18 out of 21 tasks across D4RL and NeoRL benchmarks
+25.9% average performance improvement over previous SOTA (MOBILE) on the challenging NeoRL benchmark
First method to solve 3 tasks in the NeoRL benchmark (normalized score > 95), whereas previous methods solved 0

Breakthrough Assessment

8/10

Significant performance jumps on difficult benchmarks (NeoRL) and a novel conceptual framing of dynamics consistency as an IRL problem make this a strong contribution.

⚙️ Technical Details

Problem Definition

Setting: Offline Model-Based Reinforcement Learning (MBRL) using a static dataset D of transitions

Inputs: Offline dataset D = {(s, a, r_task, s')}

Outputs: Optimized policy π that maximizes expected cumulative task reward

Pipeline Flow

Dynamics Reward Learning (IRL)
Dynamics Model Learning (Supervised)
Policy Optimization with Transition Filtering

System Modules

Dynamics Reward Learner

Learn a discriminator to distinguish true transitions from model-generated ones

Model or implementation: Ensemble of Discriminators

Dynamics Model Ensemble

Predict next states given state-action pairs

Model or implementation: Probabilistic Neural Network Ensemble

Transition Filter (Rollout)

Select high-fidelity transitions during policy rollout

Model or implementation: Softmax selection based on Dynamics Reward

Novel Architectural Elements

Reward-Consistent Transition Filtering: A sampling mechanism that uses an IRL-learned reward to select the most realistic transition from a batch of model predictions
Dynamics Reward function learned via Discriminator Ensemble: Treating the environment dynamics as an expert policy to be imitated

Modeling

Base Model: Ensemble of probabilistic neural networks (Gaussian MLP)

Training Method: Generative Adversarial Imitation Learning (GAIL) variant for reward learning; SAC for policy optimization

Objective Functions:

Purpose: Learn dynamics reward by distinguishing real data from model data.

Formally: max_D min_P E_{data}[log(D(s,a,s'))] + E_{P}[log(1-D(s,a,s'))]
Purpose: Train dynamics model via supervised learning.

Formally: max_theta E_{data}[log(P_theta(s'|s,a))]
Purpose: Select next state during rollout.

Formally: Sample proportional to exp(r_D(s,a,s')/kappa)

Training Data:

D4RL datasets (Gym locomotion)
NeoRL datasets

Key Hyperparameters:

ensemble_size_M: 7
rollout_horizon: 100
discriminator_ensemble_size_T: 10
+ 3 more
filtering_candidates_N: 10
temperature_kappa: 1.0
min_dynamics_reward_threshold: 0.3

Compute: Not reported in the paper

Comparison to Prior Work

vs. MOPO/MOBILE: MOREC actively filters transitions using a learned consistency signal rather than just penalizing the reward function
vs. RAMBO: MOREC focuses on selecting high-fidelity transitions (consistency) rather than modifying the model dynamics to be pessimistic
vs. Adversarial Model Learning (e.g. VirtualTaobao): MOREC uses an ensemble of discriminators and a transition filtering step, combining supervised and adversarial signals, whereas previous adversarial methods used single discriminators and purely generative approaches

Limitations

Performance depends on the quality of the learned dynamics reward (IRL quality)
Computationally more expensive during rollout due to sampling multiple candidates and evaluating dynamics reward
Requires tuning of filtering hyperparameters (temperature, candidate count)

Reproducibility

Code is not provided in the paper. Detailed pseudocode is provided in Algorithm 1 and Algorithm 2. Hyperparameters are listed in Appendix D.

📊 Experiments & Results

Evaluation Setup

Offline RL benchmarks measuring normalized average return

Benchmarks:

D4RL (Locomotion control (MuJoCo))
NeoRL (Locomotion control with narrower data distributions)

Metrics:

Normalized Average Return
Success Rate (Solved Tasks)
Statistical methodology: Mean and standard deviation reported over 5 seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison on D4RL benchmark showing MOREC integrated with MOBILE (MOREC-MOBILE) vs. standard MOBILE.
D4RL	Normalized Average Return	80.0	82.8	+2.8
D4RL	Solved tasks (score >= 95)	5	6	+1
Comparison on the harder NeoRL benchmark, where MOREC shows larger gains.
NeoRL	Normalized Average Return	60.7	76.4	+15.7
NeoRL	Solved tasks (score >= 95)	0	3	+3
Ablation on transition filtering.
Synthetic Refrigerator Task	Return	-640	-78	+562

Experiment Figures

Visualization of dynamics, model error, and learned dynamics reward on a synthetic refrigerator task.

MAE of model rollouts with vs. without transition filtering over rollout steps.

Main Takeaways

The learned dynamics reward correlates strongly (negative correlation) with model prediction error (MAE), even in OOD regions.
Transition filtering significantly reduces compounding errors in long-horizon rollouts (up to 100 steps).
MOREC is particularly effective on challenging datasets (NeoRL) where data distribution is narrow, enabling the system to recover distant unseen transitions.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP)
Model-Based Reinforcement Learning
Inverse Reinforcement Learning (IRL)
Generative Adversarial Networks (GANs)

Key Terms

Dynamics Reward: A learned signal indicating how consistent a transition (state, action, next_state) is with the true environment dynamics; high reward = high fidelity

Transition Filtering: The process of sampling multiple potential next states from a model ensemble and selecting the best one based on the dynamics reward

MOPO: Model-based Offline Policy Optimization—a baseline algorithm that penalizes rewards based on model uncertainty

MOBILE: Model-Based Offline Reinforcement Learning with Importance Weighting—a SOTA baseline algorithm improving upon MOPO

IRL: Inverse Reinforcement Learning—inferring a reward function from observed behavior (here, inferring the 'dynamics reward' from observed transitions)

OOD: Out-of-Distribution—states or actions not well-represented in the training dataset

MAE: Mean Absolute Error—a metric used to measure the difference between predicted and actual values