MM-LLM: Multi-Modal Large Language Model—an AI system capable of processing and reasoning across multiple data types (e.g., text, images, video)
Long-tail events: Rare, low-probability scenarios in data distributions (e.g., construction sites, jaywalkers) that are difficult for models to learn due to scarcity
PARA-Drive: A parallelized modular end-to-end autonomous driving model used here as the scene tokenizer
BEV: Bird's-Eye View—a top-down perspective of the driving scene, commonly used in autonomous driving perception
LoRA: Low-Rank Adaptation—a technique to fine-tune large language models efficiently by updating only a small subset of parameters
Object-centric tokenization: Converting a scene into discrete tokens where each token represents a specific entity (car, pedestrian) rather than a patch of pixels
L2 error: Euclidean distance error—a standard metric for measuring the difference between predicted and ground-truth trajectories