V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

📝 Paper Summary

Video Understanding World Models Robot Learning

V-JEPA 2 scales masked video modeling to a billion parameters to create a generalist world model that enables fine-grained motion understanding and zero-shot robotic planning without generating pixels.

Core Problem

Learning physical world models usually requires massive interaction data (which is scarce) or generative video modeling (which wastes compute predicting unpredictable pixel details like leaves blowing in the wind).

Why it matters:

Robot interaction data is expensive and difficult to scale compared to passive internet video
Generative approaches focus on high-entropy visual details irrelevant to planning, making them computationally inefficient for real-time control
Previous methods struggle to generalize zero-shot to new robotic environments without task-specific fine-tuning

Concrete Example: In a generative approach, a model might spend capacity predicting the exact texture of grass (high entropy) rather than the trajectory of a ball. V-JEPA 2 ignores these unpredictable pixel details by predicting in an abstract latent space, focusing only on dynamics relevant for planning.

Key Novelty

Scaled Latent Video World Model (V-JEPA 2)

Trains a massive video encoder by masking parts of videos and predicting their abstract features (not pixels), forcing the model to learn semantic scene dynamics
Post-trains a lightweight 'world model' circuit that predicts future latent states conditioned on actions, enabling planning in the abstract space without video generation

Architecture

The V-JEPA meta-architecture showing the masking and prediction mechanism.

Evaluation Highlights

77.3% top-1 accuracy on Something-Something v2 motion understanding task using an attentive probe
39.7 Recall@5 on Epic-Kitchens-100 human action anticipation, a 44% relative improvement over previous state-of-the-art
88.2% average accuracy across 6 video understanding tasks when scaling to 64-frame inputs (+4.0 points over ViT-L baseline)
Zero-shot success on Franka robot manipulation (Pick and Place) using only 62 hours of unlabeled Droid data for world model training

Breakthrough Assessment

9/10

Demonstrates that self-supervised video learning scales effectively to 1B parameters and transfers directly to robotic planning without task-specific rewards, solving a major bottleneck in robot learning.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised pre-training on video followed by action-conditioned prediction for planning

Inputs: Sequence of video frames (pixel space)

Outputs: Latent representations of future states or planned action sequences

Pipeline Flow

Pre-training: Video Encoder (ViT-g) → Masking → Predictor (Masked Denoising)
Post-training: Frozen Encoder → Action-Conditioned Predictor (World Model)
Inference/Planning: Current State → World Model (Simulate Futures) → Planner (Select Actions)

System Modules

Video Encoder

Extracts abstract latent representations from raw video frames

Model or implementation: ViT-g (1 billion parameters) with 3D-RoPE

Predictor

Predicts representations of masked video regions during pre-training

Model or implementation: Vision Transformer (smaller than encoder)

Action-Conditioned World Model

Predicts the next latent state given the current state and a proposed action

Model or implementation: 300M parameter Transformer with block-causal attention

Novel Architectural Elements

Use of 3D-RoPE (Rotary Position Embeddings) partitioned across temporal, height, and width axes for stability in 1B+ parameter video models
Hierarchical training stage: Action-free large-scale video pre-training followed by lightweight action-conditioned latent post-training

Modeling

Base Model: ViT-g (1 Billion parameters) encoder

Training Method: Stage 1: V-JEPA Masked Denoising. Stage 2: Latent Action-Conditioned Prediction.

Objective Functions:

Purpose: Train encoder to capture semantic structure.

Formally: L1 distance between predicted representation of masked region and actual representation from target encoder (EMA)
Purpose: Train world model to predict dynamics.

Formally: Autoregressive prediction of next latent frame given history and action

Training Data:

VideoMix22M: 22 million videos (SSv2, Kinetics, HowTo100M, YT-Temporal-1B, ImageNet)
Droid Dataset: 62 hours of unlabeled robot interaction data

Key Hyperparameters:

encoder_parameters: 1 Billion (ViT-g)
pretraining_iterations: 252,000
clip_duration: Up to 64 frames (progressive scaling)
+ 1 more
tubelet_size: 2x16x16

Compute: 60 GPU-years equivalent for full resolution training (optimized via progressive resolution)

Comparison to Prior Work

vs. Gen-2/Sora: V-JEPA 2 predicts in latent space, avoiding computationally expensive and unpredictable pixel generation [not cited in paper]
vs. VideoMAE: Predicts high-level features rather than low-level pixel reconstruction, focusing on semantics over texture
vs. V-JEPA (v1): Scales from 300M to 1B parameters, 2M to 22M videos, and adds action-conditioned planning capabilities

Limitations

Training large-scale video models is computationally intensive (approx 60 GPU-years for full resolution baseline)
Requires a separate post-training stage on interaction data to enable planning capabilities
World model capabilities are bounded by the information captured in the frozen encoder's representation space
Planning evaluation is zero-shot; performance might improve further with in-domain fine-tuning (not explored)

Reproducibility

Code: https://github.com/facebookresearch/vjepa2

Code is publicly available at https://github.com/facebookresearch/vjepa2. The model is pre-trained on a mix of public datasets (Kinetics, SSv2, ImageNet, etc.) and YT-Temporal-1B (which was curated using a retrieval pipeline described in the paper). Robot experiments use the public Droid dataset.

📊 Experiments & Results

Evaluation Setup

Frozen encoder evaluation using attentive probes and zero-shot robotic planning

Benchmarks:

Something-Something v2 (SSv2) (Motion Understanding / Classification)
Epic-Kitchens-100 (Action Anticipation)
PerceptionTest (Video Question Answering)
Franka Robot Manipulation (Robotic Planning (Grasp, Pick & Place)) [New]

Metrics:

Top-1 Accuracy
Recall@5
Task Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling analysis showing the cumulative benefits of increasing data, model size, training duration, and resolution.
Average of 6 Classification Tasks	Average Accuracy	84.2	88.2	+4.0
Average of 6 Classification Tasks	Average Accuracy	See Note	See Note	+1.4
PerceptionTest	Test Set Accuracy	62.3	84.0	+21.7
TempCompass	Multi-choice Accuracy	56.3	76.9	+20.6

Experiment Figures

A stair-step plot showing the cumulative improvement in average accuracy on understanding tasks as different scaling factors (Data, Model, Schedule, Resolution) are added.

Main Takeaways

Scaling self-supervised video pre-training (data, model size, resolution) yields consistent improvements in downstream understanding tasks.
A model trained purely on passive video can be successfully adapted for robotic planning using a very small amount (62h) of unlabeled interaction data.
The 'feature prediction' objective (JEPA) captures planning-relevant dynamics better than pixel-generative approaches, as evidenced by zero-shot robot manipulation success.
Progressive resolution training (starting low-res/short, ending high-res/long) reduces compute by 8.4x while maintaining performance.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision Transformers (ViT)
Concept of Self-Supervised Learning (Masked Autoencoders)
Model Predictive Control (MPC) for robotics

Key Terms

JEPA: Joint-Embedding Predictive Architecture—a learning framework where a model predicts the representation of one part of the data from another part, avoiding pixel-level generation

World Model: An internal simulation of the environment's dynamics, allowing an agent to predict the consequences of its actions before executing them

Latent Space: An abstract, compressed representation of data (e.g., video frames) where semantically similar states are close together, ignoring pixel-level noise

RoPE: Rotary Position Embedding—a method for encoding positional information in transformers by rotating the query and key vectors

ViT: Vision Transformer—a neural network architecture that processes images or video as sequences of patches using self-attention mechanisms

Tubelet: A 3D patch of video data (height × width × time) used as the input token for video transformers

Zero-shot: The ability of a model to perform a task it was not explicitly trained for, typically by leveraging generalized knowledge

MPPI: Model Predictive Path Integral—a control algorithm that samples many random action sequences, simulates their outcomes using a model, and selects the best path

Probe: A small, simple classifier trained on top of a frozen pre-trained model to evaluate the quality of its learned representations