DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

📝 Paper Summary

Model-Based Reinforcement Learning Physics-Informed World Models Unsupervised Exploration

DreamSAC enables world models to extrapolate to new physical scenarios by actively exploring to discover symmetry-based conservation laws and enforcing these invariances via a Hamiltonian dynamics prior.

Core Problem

Standard world models learn statistical pixel correlations rather than underlying physical laws, causing them to fail when extrapolating to novel viewpoints or physical parameters (e.g., changed gravity or friction).

Why it matters:

Agents in open worlds face scenarios with dynamics different from training data (e.g., handling objects with unknown mass)
Passive learning from visual data often captures spurious correlations rather than causal physical rules like conservation of energy
Existing physics-structured models (like HNNs) struggle to learn directly from high-dimensional pixels in end-to-end RL settings

Concrete Example: A world model trained on standard gravity might learn that 'objects fall at rate X'. If gravity changes to 1.5x, the model fails to predict the faster fall because it memorized the rate rather than the law of gravitation. DreamSAC detects the parameter shift via Hamiltonian error and adapts.

Key Novelty

Active Symmetry Discovery via Hamiltonian Dynamics

**Symmetry Exploration:** An intrinsic motivation strategy that rewards the agent for 'doing work' (changing the system's energy), effectively probing the environment to identify where conservation laws hold and where they break.
**Hamiltonian World Model:** Replaces the standard recurrent dynamics with a physics-constrained Hamiltonian prior. It uses a contrastive loss to strip viewpoint details from the latent state, ensuring the learned dynamics depend only on physical invariants.

Architecture

The DreamSAC pipeline split into Symmetry Exploration (Left) and World Model learning (Right).

Evaluation Highlights

+163% higher reward than DreamerV3 on Cheetah-run with unseen gravity (1.5x), demonstrating robust parametric extrapolation.
Reduces image prediction error (MSE) by >10x compared to DreamerV3 on Acrobot (H=16), proving superior long-term dynamics modeling.
Achieves ~97% success rate on FetchReach with unseen goals, outperforming standard baselines (~92%) by generalizing better to structural changes.

Breakthrough Assessment

8/10

Significantly advances physical grounding in MBRL by successfully combining active exploration with structured Hamiltonian learning from pixels, solving major extrapolation failure modes of standard models.

⚙️ Technical Details

Problem Definition

Setting: Unsupervised Reinforcement Learning followed by downstream task adaptation (In-Distribution and Out-of-Distribution)

Inputs: High-dimensional pixel observations x_t

Outputs: Actions a_t and predicted future observations/rewards

Pipeline Flow

Observation x_t → Object Encoder (SAVi) → Latent State Z_t
Latent State Z_t → Hamiltonian Dynamics (Lie Transformer + Integrator) → Next Latent State Z_{t+1}
Latent State Z_{t+1} → Decoder/Policy → Reconstruction/Action

System Modules

Object Encoder

Maps pixels to object-centric latent slots structured as position and momentum

Model or implementation: SAVi (Slot Attention)

Hamiltonian Dynamics Prior

Predicts future states by integrating physics equations derived from the learned Hamiltonian

Model or implementation: Lie Transformer (G-invariant network) + Symplectic Integrator

Exploration Policy

Selects actions to maximize symmetry-aware curiosity (work done on the system)

Model or implementation: Actor-Critic (MLP)

Novel Architectural Elements

Replacement of standard RSSM transition function with a Symplectic Integrator driven by a learned G-invariant Hamiltonian
Structuring of latent slots explicitly into generalized coordinates (q) and canonical momenta (p)
Integration of a self-supervised contrastive loss within the world model objective to enforce SE(3) invariance

Modeling

Base Model: DreamerV3 backbone

Training Method: Unsupervised Pretraining + Differentiated Fine-tuning

Objective Functions:

Purpose: Reconstruction.

Formally: L_pred = E_q[log p(x_t | Z_t, h_t)]
Purpose: Dynamics Consistency (KL Divergence).

Formally: L_dyn = KL(q_phi(Z_t | x_t) || p_phi(Z_t | Z_{t-1}, a_{t-1}))
Purpose: Viewpoint Invariance (Contrastive).

Formally: L_vr = -E[log(exp(sim(Z_A, Z_B)/tau) / Sum(exp(sim(Z_A, Z_j)/tau)))]
Purpose: Symmetry Exploration Reward.

Formally: r_sym = |H(Z_{t+1}) - H(Z_t)| - lambda ||a_t - a_{t-1}||^2

Training Data:

2M steps unsupervised pretraining
500K steps downstream task adaptation

Key Hyperparameters:

pretrain_steps: 2,000,000
finetune_steps: 500,000
batch_size: Not explicitly reported in the paper
+ 2 more
lambda_s (smoothness): Not explicitly reported in the paper
tau (temperature): Not explicitly reported in the paper

Comparison to Prior Work

vs. DreamerV3: DreamSAC enforces physical structure (Hamiltonian) and uses contrastive learning for invariance, whereas DreamerV3 learns purely statistical dynamics.
vs. RND: DreamSAC uses physics-based curiosity (work done) to probe dynamics, whereas RND rewards statistical prediction error.
vs. HNN/LNN [not cited in paper]: DreamSAC learns directly from pixels via a structured encoder, whereas classical HNNs typically assume low-dimensional state inputs.

Limitations

Relies on the assumption that the environment follows Hamiltonian dynamics (conservation laws), which may not hold for all systems (e.g., highly dissipative non-physical games).
Requires effective learning of the object-centric encoder (SAVi); if object slots fail to capture entities, the dynamics model fails.
Computationally more intensive than standard RSSM due to the symplectic integrator and Lie Transformer operations.

Reproducibility

No public code URL provided in the paper. Implementation details (architectures, hyperparameters) are referenced as being in 'Supp. 7', but the supplement text is not fully contained in the main paper PDF.

📊 Experiments & Results

Evaluation Setup

3D Physics Simulations (DeepMind Control Suite, GymFetch) with OOD perturbations.

Benchmarks:

DeepMind Control Suite (DMCS) (Continuous Control)
GymFetch (Robotic Manipulation)

Metrics:

Mean Squared Error (MSE) of image prediction
Task Reward / Success Rate
Generalization Gap (OOD performance)
Statistical methodology: Mean ± standard deviation over 5 seeds reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Predictive accuracy results showing DreamSAC significantly reduces long-term prediction error compared to DreamerV3.
Acrobot (H=16)	MSE	0.7723	0.2064	-0.5659
FetchPush (H=8)	MSE	1.048	0.302	-0.746
Out-of-Distribution (OOD) generalization results demonstrating robustness to physical parameter changes and novel viewpoints.
Cheetah-run (Unseen Gravity 1.5x)	Mean Reward	189.76	499.91	+310.15
Reacher-hard (Unseen View)	Mean Reward	265.33	321.90	+56.57
FetchReach (Unseen Goal)	Success Rate	919.73	967.64	+47.91
Walker (1.5x Gravity)	MSE	4.9673	1.0044	-3.9629

Experiment Figures

Qualitative analysis including t-SNE of latent states, Hamiltonian conservation plots, and reward curves.

Main Takeaways

DreamSAC consistently outperforms DreamerV3 on OOD tasks, especially those involving changes in physical parameters like gravity and friction.
The Hamiltonian prior is critical for parameter generalization; ablating it (w/o H_phi) leads to severe performance drops in altered environments.
The Viewpoint-Robustness Loss (L_vr) is essential for structural generalization (Unseen View); removing it degrades performance significantly.
Symmetry Exploration (active curiosity) gathers more informative data than random or RND exploration, leading to better converged world models.

📚 Prerequisite Knowledge

Prerequisites

Hamiltonian Mechanics (generalized coordinates and momenta)
Variational Inference (ELBO)
Contrastive Learning (SimCLR)
Model-Based RL (Dreamer architecture)

Key Terms

Hamiltonian: A function describing the total energy of a system (kinetic + potential); its conservation implies physical symmetries.

Symplectic Integrator: A numerical integration scheme that preserves the geometric structure (and energy conservation) of Hamiltonian systems over time.

Lie Transformer: A neural network architecture designed to be equivariant to Lie group transformations (e.g., rotation, translation), used here to parameterize the invariant Hamiltonian.

SAVi: Slot Attention for Video—an object-centric encoder that decomposes images into discrete 'slots' representing objects.

ELBO: Evidence Lower Bound—the objective function used to train variational autoencoders and world models, balancing reconstruction accuracy with latent space regularity.

RND: Random Network Distillation—an exploration method that rewards agents for visiting states where a fixed random network's output is hard to predict (novelty).

RSSM: Recurrent State-Space Model—the probabilistic dynamics model used in Dreamer, combining deterministic recurrent states with stochastic latent variables.