DPO: Direct Preference Optimization—a method to align models with preferences without explicitly training a reward model during the policy update, though OneRec uses a reward model to *select* the data for DPO
MoE: Mixture-of-Experts—a neural network architecture where different parts of the network ('experts') specialize in different inputs, allowing huge parameter counts with low inference cost
RQ-VAE: Residual Quantized Variational AutoEncoder—a method used to compress high-dimensional vectors (like item embeddings) into discrete codes (semantic IDs) for generation
Cascade Ranking: The traditional industrial standard pipeline consisting of multiple stages (recall, pre-ranking, ranking, re-ranking) to filter millions of items down to a few dozen
Self-hard negative sampling: A strategy where the model's own high-probability but low-reward generations are used as negative examples during training to force it to distinguish fine-grained differences
Session-wise generation: Generating a complete list of items (a session) in one go, rather than predicting just the single next item
Semantic IDs: Discrete tokens representing items (videos) derived from their content embeddings, allowing a language model to 'generate' items
IPA: Iterative Preference Alignment—repeatedly generating samples, scoring them with a reward model, and retraining the generator using DPO on the new data