HiP: Compositional Foundation Models for Hierarchical Planning—the proposed system chaining language, vision, and action models
Inverse Dynamics Model: A model that predicts the action required to transition between two observed states (frames)
Video Diffusion Model: A generative model that creates video sequences from noise, conditioned on text or images, used here for visual planning
Ego4D: A large-scale ego-centric (first-person view) video dataset used for pre-training the video model
VC-1: A pre-trained visual representation model designed for robotics, used to initialize the inverse dynamics model
Iterative Refinement: A feedback process where the feasibility of a plan at a lower level (e.g., action) updates the probability of the plan at a higher level (e.g., vision)
Density Ratio Estimation: A technique to estimate the probability of a sample under a conditional distribution by training a classifier to distinguish between feasible and infeasible samples