Reward-Free Curricula for Training Robust World Models

📝 Paper Summary

Unsupervised Environment Design (UED) Model-Based Reinforcement Learning Robustness

WAKER trains robust world models in reward-free settings by actively sampling environments where the model's latent dynamics ensemble exhibits the highest disagreement.

Core Problem

Training generalist agents requires robustness across diverse environments, but existing curriculum learning methods (UED) depend on task-specific rewards, which are unavailable during reward-free pre-training.

Why it matters:

Agents must adaptable to new tasks and physical dynamics without retraining from scratch, but standard Domain Randomization is inefficient for hard-to-learn variations.
Current Unsupervised Environment Design (UED) methods cannot operate in the reward-free exploration phase, limiting the development of truly generalist autonomous agents.

Concrete Example: A robot trained via standard randomization might spend equal time on high-friction and low-friction surfaces. However, low-friction 'ice' physics are harder to model. WAKER detects the higher prediction error on 'ice' and automatically prioritizes sampling that environment, whereas standard methods would ignore the model's uncertainty.

Key Novelty

WAKER (Weighted Acquisition of Knowledge across Environments for Robustness)

Connects the robust optimization objective of 'minimax regret' to minimizing the maximum world model error across environment instances.
Biases environment sampling towards parameter settings where an ensemble of latent dynamics models has the highest disagreement (proxy for error), without needing external rewards.

Architecture

The WAKER training loop and environment selection process.

Evaluation Highlights

Outperforms Domain Randomization and intrinsically-motivated baselines on distorted continuous control tasks.
Demonstrates improved generalization to out-of-distribution (OOD) environments compared to random sampling.
Successfully trains a single generalist policy that is robust across variations in environmental parameters (e.g., physics constants).

Breakthrough Assessment

7/10

Novel extension of Unsupervised Environment Design to the reward-free setting by theoretically linking minimax regret to model error. Addresses a significant gap in training generalist agents.

⚙️ Technical Details

Problem Definition

Setting: Reward-free Underspecified Partially Observable Markov Decision Process (UPOMDP)

Inputs: Observations o and actions a from an environment with specific parameters theta

Outputs: Latent state z and predicted future latent states

Pipeline Flow

Environment Selection: Error History -> Parameter Selection
Data Collection: Exploration Policy -> Trajectory
Model Learning: Trajectory -> World Model Update
Error Estimation: Imagined Rollouts -> Error Update

System Modules

Environment Selector

Selects environment parameters theta for the next episode

Model or implementation: Boltzmann distribution over Error History

World Model

Learns to represent and predict environment dynamics

Model or implementation: DreamerV2 (Representation Model q + Latent Dynamics T)

Dynamics Ensemble

Estimates model uncertainty/error via disagreement

Model or implementation: Ensemble of N latent transition models

Novel Architectural Elements

Integration of an error-based curriculum loop directly into the reward-free world model training process
Use of ensemble disagreement in latent space specifically to drive environment parameter selection (not just exploration within an episode)

Modeling

Base Model: DreamerV2

Training Method: Supervised learning for World Model; RL for Exploration Policy

Objective Functions:

Purpose: Maximize exploration to find model errors.

Formally: pi_expl maximizes expected error (Total Variation distance) of latent dynamics.
Purpose: Minimize maximum regret across environments.

Formally: Minimize maximum World Model Error (prediction error under exploration policy).
Purpose: Estimate model error for curriculum.

Formally: Disagreement of ensemble means in imagined trajectories.

Key Hyperparameters:

ensemble_size: N (Not explicitly specified in text snippet)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Domain Randomization: WAKER actively biases sampling based on model error rather than sampling uniformly.
vs. Standard UED: WAKER operates in the reward-free setting using model error as a proxy for regret, whereas UED requires task rewards.

Limitations

Relies on the assumption (Assumption 1) that the optimal policy can be recovered from the world model given a reward function later.
Requires training an ensemble of dynamics models, which increases computational cost compared to a single model.
The error estimate is a proxy (ensemble disagreement) and may not perfectly reflect true Total Variation error.

Reproducibility

The paper defines the algorithm WAKER and the problem setting formally. It specifies the base architecture (DreamerV2) and the domains (Distorted Pointmass, Pendulum, HalfCheetah). Hyperparameters like ensemble size N and domain randomization probability p_DR are mentioned as variables but specific values are not in the provided text snippet. Code availability is not mentioned.

📊 Experiments & Results

Evaluation Setup

Reward-free pre-training followed by evaluation on downstream tasks with specific reward functions in various environment instances.

Benchmarks:

Distorted Pointmass (Continuous Control (Pixel-based)) [New]
Distorted Pendulum (Continuous Control (Pixel-based)) [New]
Distorted HalfCheetah (Continuous Control (Pixel-based)) [New]

Metrics:

Minimax Regret (approximated)
Downstream Task Performance (Returns)
Robustness to Out-of-Distribution (OOD) environments
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Conceptual illustration of the World Model representing the UPOMDP.

Main Takeaways

WAKER successfully generates curricula in the reward-free setting, a capability previously limited to reward-aware UED methods.
Biasing environment sampling towards high-error instances leads to more robust world models compared to uniform sampling (Domain Randomization).
The method improves generalization to out-of-distribution environments, suggesting the active curriculum captures underlying dynamics better than random baselines.
Minimizing maximum model error is a theoretically sound proxy for minimizing minimax regret in the reward-free context.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
World Models (Model-Based RL)
Unsupervised Environment Design (UED)
Minimax Regret

Key Terms

UED: Unsupervised Environment Design—methods that generate curricula of environments to train robust agents

UPOMDP: Underspecified POMDP—a POMDP with a set of free parameters (like friction or gravity) that vary between episodes

World Model: A predictive model (often RNN-based) that learns environment dynamics in a compressed latent space

Minimax Regret: A robustness objective that seeks a policy minimizing the maximum difference between its performance and the optimal performance across all possible environments

Domain Randomization: Training an agent on a variety of simulated environments with randomized properties to improve generalization

DreamerV2: A specific world model architecture using recurrent neural networks and discrete latent states

Latent Dynamics: The transition rules governing how the compressed state of the world model evolves over time

WAKER: Weighted Acquisition of Knowledge across Environments for Robustness—the proposed algorithm