DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

📝 Paper Summary

Model-based Control World Models Visual Representation Learning

DINO-WM learns a world model on pre-trained DINOv2 patch features from offline trajectories, enabling zero-shot planning for visual goals without reconstruction or task-specific rewards.

Core Problem

Existing world models often require online interaction, task-specific rewards, or computationally expensive pixel reconstruction, limiting their ability to generalize to new tasks zero-shot from offline data.

Why it matters:

Feed-forward policies require training on all possible scenarios to generalize, which is infeasible
Online world models require retraining for every new task, limiting efficiency
Current offline world models rely on strong auxiliary info like expert demos or dense rewards, reducing generality

Concrete Example: In a maze navigation task, a standard policy trained on specific routes fails when the goal location changes. DINO-WM, trained on random offline trajectories, can plan a path to any visible goal location at test time without retraining.

Key Novelty

Latent Dynamics on Pre-trained Patch Features

Uses frozen DINOv2 patch embeddings as the state space, leveraging their strong spatial and object-centric priors without learning an observation model from scratch
Trains a decoder-only Transformer to predict future patch features autoregressively, conditioned on actions
Performs planning via Model Predictive Control (MPC) in the latent space by optimizing actions to minimize distance to a goal embedding

Architecture

The DINO-WM architecture including the frozen observation model, the causal transition model, and the optional decoder.

Evaluation Highlights

Improves success rate by 45% on average over prior state-of-the-art (IRIS) on the hardest navigation and manipulation tasks
Achieves 56% improvement in visual reconstruction metrics (LPIPS) compared to IRIS, indicating higher fidelity future prediction
Demonstrates zero-shot generalization to new maze layouts and object shapes not seen during specific task training, outperforming baselines that require task-specific learning

Breakthrough Assessment

8/10

Significant step in uncoupling world models from task-specific rewards or online data. Shows that general-purpose visual features (DINOv2) are sufficient for precise physical control via simple latent dynamics.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) with offline dataset of trajectories

Inputs: Current observation image o_0, Goal observation image o_g

Outputs: Sequence of actions a_0, ..., a_T to reach o_g

Pipeline Flow

Input Processing: Image -> DINOv2 Encoder -> Patch Embeddings
Dynamics Modeling: History of Patch Embeddings + Actions -> Transformer -> Predicted Future Embeddings
Planning (Test Time): Current + Goal Embeddings -> MPC (CEM) -> Optimal Action Sequence

System Modules

Observation Model

Encodes raw RGB images into compact patch embeddings

Model or implementation: Frozen DINOv2 (ViT-Small/14)

Transition Model

Predicts future latent states based on history and actions

Model or implementation: Decoder-only Vision Transformer (ViT)

Planner

Optimizes action sequence to reach goal state

Model or implementation: Cross-Entropy Method (CEM)

Novel Architectural Elements

Use of frozen, pre-trained DINOv2 patch features as the sole state representation for world modeling
Frame-level autoregressive prediction for patch embeddings (treating all patches of a frame as a block) rather than token-level prediction

Modeling

Base Model: DINOv2 (ViT-S/14) for encoding; Custom ViT for dynamics

Training Method: Supervised learning on offline trajectories (Teacher Forcing)

Objective Functions:

Purpose: Minimize difference between predicted and actual future patch embeddings.

Formally: MSE(z_t, TransitionModel(z_{<t}, a_{<t}))
Purpose: (Optional/Diagnostic) Reconstruct pixels from latents.

Formally: MSE(o_t, Decoder(z_t))

Adaptation: Trains Transition Model from scratch; Observation model is frozen

Trainable Parameters: Transition model weights

Training Data:

Offline datasets from 6 environments (Maze, Push, Franka Kitchen, etc.)
Data collected via random policies or scripted policies depending on domain

Key Hyperparameters:

context_length_H: Not explicitly reported in the paper body (likely standard for windowed Transformers)
optimizer: Adam (implied)
batch_size: Not reported in the paper
+ 1 more
learning_rate: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. IRIS: DINO-WM uses continuous pre-trained features instead of learning a discrete codebook; predicts frame-level instead of token-level.
vs. DreamerV3: DINO-WM does not learn the encoder; relies on pre-trained foundation model features.
vs. JEPA [not cited in paper]: JEPA also predicts in latent space but typically uses a context encoder and predictor trained jointly; DINO-WM uses a fixed pre-trained encoder and trains only the predictor.

Limitations

Relies on the quality of DINOv2 features; if features miss task-critical details, the world model fails
No explicit mechanism to handle stochasticity (assumes deterministic or unimodal dynamics via MSE loss)
Planning cost is simple L2 distance in latent space, which may not always correspond to functional distance
Does not model rewards or termination signals, purely visual goal reaching

Reproducibility

Code: https://dino-wm.github.io

Code and models are open-sourced at https://dino-wm.github.io. The paper mentions using DINOv2-ViT-S/14. Specific training hyperparameters (LR, batch size) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Goal-conditioned control tasks across navigation and manipulation domains

Benchmarks:

Maze Navigation (Navigation)
Push Manipulation (Robotic Manipulation)
Franka Kitchen (Robotic Manipulation)
Deformable Object Manipulation (Robotic Manipulation)

Metrics:

Success Rate (Task completion)
LPIPS (Visual reconstruction quality of world model predictions)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DINO-WM consistently outperforms the IRIS baseline on visual reconstruction quality (lower LPIPS is better) across diverse environments.
Push T	LPIPS	0.198	0.088	-0.110
RoboYoga	LPIPS	0.158	0.063	-0.095
Franka Kitchen	LPIPS	0.076	0.040	-0.036
DINO-WM achieves significantly higher success rates in zero-shot planning tasks compared to IRIS.
Push T	Success Rate	0.14	0.45	+0.31
RoboYoga	Success Rate	0.33	0.74	+0.41
Franka Kitchen	Success Rate	0.10	0.38	+0.28

Experiment Figures

Visualizations of future predictions (hallucinations) by DINO-WM vs IRIS over long horizons.

Main Takeaways

Using pre-trained DINOv2 features is superior to learning observation models from scratch for world modeling on offline data.
Latent space planning with simple MSE distance to goal features is effective for zero-shot control without task-specific rewards.
The method generalizes well to variations (e.g., maze layouts, object shapes) that were not explicitly separated during training.
CEM optimization outperforms Gradient Descent for planning in this latent space.

📚 Prerequisite Knowledge

Prerequisites

World Models / Dynamics Models
Model Predictive Control (MPC)
Vision Transformers (ViT)
Self-supervised learning (DINO/DINOv2)

Key Terms

DINOv2: A self-supervised vision transformer model pre-trained on large-scale data, providing robust visual features

MPC: Model Predictive Control—a control method that optimizes a sequence of actions by using a model to predict future outcomes

CEM: Cross-Entropy Method—an optimization algorithm used here to search for the best action sequence by iteratively sampling and refining distributions

LPIPS: Learned Perceptual Image Patch Similarity—a metric used to evaluate how perceptually similar two images are

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment

teacher forcing: A training technique where the model is fed the actual previous ground truth (rather than its own prediction) as input for the next step

ViT: Vision Transformer—a model architecture that processes images as sequences of patch embeddings using self-attention