Generative Recommendation (GR): A paradigm where an LLM directly generates the identifier (Semantic ID) of the recommended item token-by-token, rather than classifying or ranking existing items
Semantic ID: A multi-token discrete code (e.g., derived from RQ-VAE) used to represent a specific item in the LLM's vocabulary
Speculative Decoding (SD): An inference acceleration technique where a cheaper 'draft' model guesses future tokens, which are then verified in parallel by the main 'target' model
Self-drafting: A variant of SD where the main model itself (via an auxiliary head) generates draft tokens, avoiding the need for a separate draft model
Hallucination: In this context, the generation of a Semantic ID sequence that does not correspond to any valid item in the catalog
Model-free verification: NEZHA's technique of checking drafted tokens against a pre-computed set of valid IDs rather than using the LLM to verify probability/correctness
Recall@K: A metric measuring the proportion of relevant items found in the top-K recommendations
KV-Caching: Optimization that stores Key and Value states of attention mechanisms to avoid recomputing past context during autoregressive generation
RQ-VAE: Residual Quantized Variational AutoEncoder—a method often used to create discrete Semantic IDs for items