VLLM: Video Large Language Model—a multimodal model trained on internet-scale video-text pairs to understand and reason about video content
MoE: Mixture of Experts—a machine learning technique where different sub-models (experts) specialize in different parts of the input space, controlled by a gating network
KV Cache: Key-Value Cache—a technique to speed up transformer inference by storing previously computed attention keys and values
Language Bottleneck: The loss of information that occurs when rich visual data is compressed into a text summary or caption before being processed by downstream systems
Store-and-Retrieve: An architecture where expensive feature extraction is done offline and stored in a database, allowing the online system to simply look up embeddings for fast inference
Old vs. New Tokens: In this paper, 'old' tokens refer to the original input (video/text) tokens, while 'new' tokens refer to those generated autoregressively by the model
Cold-start: The difficulty of recommending items that have little or no interaction history (e.g., new videos uploaded to a platform)