Evaluation Setup
Reward-free pre-training followed by evaluation on downstream tasks with specific reward functions in various environment instances.
Benchmarks:
- Distorted Pointmass (Continuous Control (Pixel-based)) [New]
- Distorted Pendulum (Continuous Control (Pixel-based)) [New]
- Distorted HalfCheetah (Continuous Control (Pixel-based)) [New]
Metrics:
- Minimax Regret (approximated)
- Downstream Task Performance (Returns)
- Robustness to Out-of-Distribution (OOD) environments
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- WAKER successfully generates curricula in the reward-free setting, a capability previously limited to reward-aware UED methods.
- Biasing environment sampling towards high-error instances leads to more robust world models compared to uniform sampling (Domain Randomization).
- The method improves generalization to out-of-distribution environments, suggesting the active curriculum captures underlying dynamics better than random baselines.
- Minimizing maximum model error is a theoretically sound proxy for minimizing minimax regret in the reward-free context.