Semantic IDs: Discrete token sequences representing documents, where similar documents share similar prefixes (derived from hierarchical clustering or quantization)
RQ-VAE: Residual Quantized Variational AutoEncoder—a method to compress vectors into discrete codes by recursively quantizing residuals
Conflict Index: A non-semantic integer appended to a semantic ID to distinguish multiple documents that map to the same semantic prefix
ECM: Exhaustive Candidate Matching—a proposed global search algorithm that finds the optimal unique ID assignment by evaluating all combinations of top-k candidates
RRS: Recursive Residual Searching—a proposed greedy algorithm that builds unique IDs level-by-level, backtracking if conflicts occur
Cold-start: The scenario of recommending or retrieving items that have little to no prior interaction history
Centroid: The center point of a cluster in the quantization codebook
Residual: The difference between the original vector and the sum of selected centroids so far