| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| DINO-WM consistently outperforms the IRIS baseline on visual reconstruction quality (lower LPIPS is better) across diverse environments. | ||||
| Push T | LPIPS | 0.198 | 0.088 | -0.110 |
| RoboYoga | LPIPS | 0.158 | 0.063 | -0.095 |
| Franka Kitchen | LPIPS | 0.076 | 0.040 | -0.036 |
| DINO-WM achieves significantly higher success rates in zero-shot planning tasks compared to IRIS. | ||||
| Push T | Success Rate | 0.14 | 0.45 | +0.31 |
| RoboYoga | Success Rate | 0.33 | 0.74 | +0.41 |
| Franka Kitchen | Success Rate | 0.10 | 0.38 | +0.28 |