JEPA: Joint-Embedding Predictive Architecture—a learning framework where a model predicts the representation of one part of the data from another part, avoiding pixel-level generation
World Model: An internal simulation of the environment's dynamics, allowing an agent to predict the consequences of its actions before executing them
Latent Space: An abstract, compressed representation of data (e.g., video frames) where semantically similar states are close together, ignoring pixel-level noise
RoPE: Rotary Position Embedding—a method for encoding positional information in transformers by rotating the query and key vectors
ViT: Vision Transformer—a neural network architecture that processes images or video as sequences of patches using self-attention mechanisms
Tubelet: A 3D patch of video data (height × width × time) used as the input token for video transformers
Zero-shot: The ability of a model to perform a task it was not explicitly trained for, typically by leveraging generalized knowledge
MPPI: Model Predictive Path Integral—a control algorithm that samples many random action sequences, simulates their outcomes using a model, and selects the best path
Probe: A small, simple classifier trained on top of a frozen pre-trained model to evaluate the quality of its learned representations