Semantic IDs: Discrete tokens (codes) used to represent items in an LLM, generated by quantizing continuous embeddings
Embedding Collapse: A phenomenon where an embedding matrix uses only a small subspace of its available dimensions (low rank), limiting expressiveness
Catastrophic Forgetting: The loss of previously learned patterns (here, distance relationships) when a model is trained on a new task or with new initializations
RQ-VAE: Residual Quantized Variational Autoencoder—a model that compresses embeddings into a sequence of discrete codes hierarchically
MMD: Maximum Mean Discrepancy—a statistical distance metric used here to match the distribution of reconstructed embeddings to original ones
Kendall's tau: A statistic used to measure the ordinal association between two measured quantities (here, the preservation of distance rankings)
LoRA: Low-Rank Adaptation—an efficient fine-tuning method that updates only a small subset of parameters