SCID: Semantic-Collaborative ID—discrete tokens that encode both the semantic meaning (text) and collaborative patterns (interactions) of users or items
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without a separate reward model, used here to tune recommendations
SFT: Supervised Fine-Tuning—training the model on labeled data (user interaction sequences) to establish initial capabilities
Collaborative signals: Information derived from the history of user-item interactions (e.g., who bought what) rather than just the content of the items
RQ-VAE: Residual Quantized Variational AutoEncoder—a method used to compress continuous embeddings into discrete codes (tokens) for the LLM
Self-Play: A training strategy where the model generates its own data and interacts with itself to create diverse training examples for preference learning
NTP: Next Token Prediction—the standard training objective for language models
SP-DPO: Self-Play Direct Preference Optimization—using self-generated data for preference alignment
RF-DPO: Real-world Feedback Direct Preference Optimization—using actual user feedback (clicks, likes) for preference alignment