Compositional Foundation Models for Hierarchical Planning

📝 Paper Summary

Hierarchical Planning Robotic Manipulation Foundation Models

HiP solves long-horizon robotic tasks by chaining independently trained language, vision, and action foundation models, using iterative refinement classifiers to ensure the output of one model is feasible for the next.

Core Problem

Solving long-horizon tasks in novel environments requires hierarchical reasoning across language, vision, and control, but collecting paired data across all three modalities is expensive and unscalable.

Why it matters:

End-to-end training requires massive, expensive paired language-vision-action datasets that are hard to scale
Existing methods that fine-tune large language models (LLMs) on robot data face barriers because top-tier model weights (e.g., GPT-4) are often closed-source
Naïve composition of independent models fails because abstract plans (language) may generate subgoals that are physically impossible in the current visual environment

Concrete Example: If a language model suggests 'pick up the kettle from the cabinet', but the video model sees no cabinet in the current room, a naïve chain will fail. HiP uses feedback to reject this plan before attempting execution.

Key Novelty

Iterative Refinement for Compositional Foundation Models (HiP)

Decomposes planning into three independently trained experts: Language (Task), Video Diffusion (Visual), and Inverse Dynamics (Action), avoiding the need for paired tri-modal data
Introduces 'iterative refinement' where lightweight classifiers act as critics, using feedback from downstream models (e.g., 'is this video executable?') to guide the sampling of upstream models (e.g., 'generate a better video')

Architecture

The hierarchical architecture of HiP. It shows the flow from Language Goal -> Task Planner -> Visual Planner -> Action Planner, with backward feedback loops for refinement.

Evaluation Highlights

Demonstrates efficacy on three distinct long-horizon table-top manipulation tasks involving multi-step reasoning
Approach works with models that offer only API access, as the refinement mechanism trains separate classifiers rather than fine-tuning the large foundation models themselves

Breakthrough Assessment

7/10

Offers a pragmatic, modular solution to the data scarcity problem in robotics by leveraging Internet-scale pre-training without requiring expensive end-to-end paired data.

⚙️ Technical Details

Problem Definition

Setting: Hierarchical decision-making for long-horizon goals specified in language

Inputs: Natural language goal g and current observation image x_{i,1}

Outputs: Sequence of actions a_{i, 1:T-1} to achieve the goal

Pipeline Flow

Group: Task Planning (LLM generates subgoals)
Group: Visual Planning (Diffusion generates video plan)
Group: Action Planning (IDM generates motor commands)

System Modules

Task Planner

Decompose high-level language goal into a sequence of subgoals

Model or implementation: Pretrained Large Language Model (LLM)

Visual Planner

Generate a physically plausible observation trajectory (video) visualizing the subgoal execution

Model or implementation: Video Diffusion Model (pretrained on Ego4D)

Action Planner

Infer specific motor actions to execute the transitions seen in the video plan

Model or implementation: Inverse Dynamics Model (initialized with VC-1)

Novel Architectural Elements

Feedback hierarchy: Downstream models (action/vision) provide likelihood signals to refine upstream sampling (vision/language) without end-to-end gradient propagation
Classifier-based density estimation: Using auxiliary classifiers to approximate the conditional likelihood of plans instead of expensive Monte Carlo sampling

Modeling

Base Model: Composite of LLM (Task), Video Diffusion (Visual), and Inverse Dynamics (Action)

Training Method: Modular training/finetuning of individual components + training lightweight classifiers for consistency

Objective Functions:

Purpose: Ensure selected subgoal maximizes consistency with visual observation.

Formally: Maximize p(x_{i,1}|w_i, g)/p(x_{i,1}|g) estimated via a multi-class classifier
Purpose: Ensure generated video is actionable.

Formally: Bias video denoising using log-likelihood from a binary classifier g_ψ(τ_x) that distinguishes feasible/infeasible trajectories
Purpose: Train inverse dynamics to map visual changes to actions.

Formally: Maximize p_ψ(a_{i,t} | x_{i,t}, x_{i,t+1})

Training Data:

D_classify: {observation, goal, candidate_subgoals, correct_label}
D_video: {observation_trajectory, subgoal} (Finetuning)
D_inv: {observation_trajectory, action_trajectory} (Finetuning)

Compute: Refinement is computationally efficient as it does not require finetuning the large foundation models themselves

Comparison to Prior Work

vs. Gato/RT-1: HiP uses separate experts trained on disparate data sources rather than requiring a single massive paired dataset
vs. PaLM-E: HiP does not require access to LLM weights or expensive LLM finetuning; it uses lightweight external classifiers for grounding

Limitations

Relies on the availability of task-specific data for finetuning the video and inverse dynamics models (though less than end-to-end approaches)
Inference latency may be higher due to the multi-stage generation (text -> video -> action)
The iterative refinement relies on the accuracy of the auxiliary classifiers; if they fail, the plan will be inconsistent
No quantitative results or tables were present in the provided text snippet to verify performance gains

Reproducibility

The paper outlines the method and datasets (Ego4D, VC-1) but does not explicitly provide a code repository URL in the text. Task-specific datasets (D_video, D_inv) are implied to be collected for the specific tabletop environments.

📊 Experiments & Results

Evaluation Setup

Long-horizon tabletop manipulation tasks requiring multi-step reasoning

Benchmarks:

Table-top manipulation tasks (Robotic Control) [New]

Metrics:

Task Success Rate
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper qualitatively asserts that HiP enables agents to solve novel long-horizon tasks by effectively decomposing them.
The compositional approach allows leveraging diverse internet data (text, video) without needing them to be paired with actions.
Iterative refinement is crucial for preventing 'hallucinated' plans that look plausible in text but are impossible to execute physically.
Note: Specific numerical results (success rates, baselines comparisons) were not available in the provided text snippet.

📚 Prerequisite Knowledge

Prerequisites

Understanding of diffusion models for video generation
Basics of Inverse Dynamics Models (IDM)
Knowledge of Large Language Models (LLMs) for planning
Bayesian inference (conditional probability and density estimation)

Key Terms

HiP: Compositional Foundation Models for Hierarchical Planning—the proposed system chaining language, vision, and action models

Inverse Dynamics Model: A model that predicts the action required to transition between two observed states (frames)

Video Diffusion Model: A generative model that creates video sequences from noise, conditioned on text or images, used here for visual planning

Ego4D: A large-scale ego-centric (first-person view) video dataset used for pre-training the video model

VC-1: A pre-trained visual representation model designed for robotics, used to initialize the inverse dynamics model

Iterative Refinement: A feedback process where the feasibility of a plan at a lower level (e.g., action) updates the probability of the plan at a higher level (e.g., vision)

Density Ratio Estimation: A technique to estimate the probability of a sample under a conditional distribution by training a classifier to distinguish between feasible and infeasible samples