Semantic IDs (SIDs): Discrete, hierarchical token sequences representing items, derived from content embeddings via quantization
LEM: Large Embedding Model—traditional recommendation architecture relying on massive embedding tables for categorical features (e.g., item IDs)
RQ-VAE: Residual-Quantized Variational AutoEncoder—a model used to compress dense embeddings into discrete codes (SIDs) by recursively quantizing residuals
CPT: Continued Pre-Training—an intermediate training stage where an LLM is trained on domain-specific data (user history, item metadata) to align SIDs with text
SFT: Supervised Fine-Tuning—the final training stage optimizing the model for the specific recommendation objective (predicting the next clicked video SID)
MoE: Mixture-of-Experts—a neural network architecture where different sub-networks (experts) are activated for different inputs, allowing immense scaling
Generative Retrieval: A paradigm where the model directly generates item identifiers (like SIDs) rather than selecting them via dot-product similarity search
Co-occurrence contrastive loss: A loss function used during SID training that pulls representations of items watched together closer, injecting collaborative filtering signals
Progressive Masking: A technique in SID training that randomly masks deeper codebook levels to enforce hierarchical structure and robustness