BEV: Bird's-Eye-View—a top-down cartographic representation of a scene, commonly used in autonomous driving to unify multi-view camera inputs into a single spatial coordinate system
MLLM: Multimodal Large Language Model—an AI model capable of processing both text and visual inputs (images/video) to generate text responses
Q-Former: Querying Transformer—a module that acts as a bridge between frozen image encoders and frozen LLMs, using learnable queries to extract relevant visual features
NuScenes: A popular large-scale dataset for autonomous driving containing multi-view camera data, lidar, and radar with 3D annotations
SQL: Structured Query Language—used here to programmatically query scene metadata (e.g., 'SELECT distance WHERE object_id=X') to generate QA pairs automatically
MAE: Mean Absolute Error—a metric measuring the average magnitude of errors in a set of predictions, used here for distance and speed estimation
BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text generated by a machine, measuring overlap with reference text