World Model: A predictive model that simulates future states of the environment based on current states and actions, often used for planning
FID: Fréchet Inception Distance—a metric for evaluating the quality of generated images by comparing the distribution of generated vs. real images
FVD: Fréchet Video Distance—extension of FID to video, measuring temporal coherence and quality
OOD: Out-of-Distribution—scenarios that differ significantly from the training data (e.g., driving off-center)
UniAD: Unified Autonomous Driving—a state-of-the-art end-to-end autonomous driving model used as a baseline planner
Reference Views: A subset of camera views (e.g., Front, Back-Left, Back-Right) generated first in the factorization scheme
Stitched Views: Intermediate camera views generated conditioned on adjacent reference views to ensure spatial consistency
End-to-End Planning: A system that takes raw sensor data and outputs control/trajectory commands directly, rather than using separate perception/prediction/planning modules
BEV: Bird's Eye View—a top-down perspective of the driving scene
CLIP: Contrastive Language-Image Pre-training—a model used to encode text descriptions into embeddings