OneRec: Unifying Retrieve and Rank with Generative Recommender and Preference Alignment

📝 Paper Summary

Generative Recommendation Large Language Models (LLMs) for Recommendation Preference Alignment

OneRec replaces traditional cascade ranking systems with a single-stage generative model that produces session-wise video lists, optimized via Direct Preference Optimization using self-hard negative sampling.

Core Problem

Traditional recommender systems rely on a complex cascade of independent rankers (recall, pre-ranking, ranking), where errors propagate and isolated optimization limits overall performance.

Why it matters:

The effectiveness of each isolated stage limits the upper bound of subsequent stages in cascade systems
Current generative retrieval models act only as selectors in the retrieval stage and fail to match the accuracy of well-designed multi-stage rankers
Point-by-point generation lacks context awareness, requiring hand-crafted rules to ensure diversity and coherence within a recommendation session

Concrete Example: In a standard cascade system, if the 'recall' stage fails to retrieve a relevant niche video, the subsequent 'ranking' stage never sees it, making recovery impossible. OneRec generates the final list directly from the full item space, avoiding this bottleneck.

Key Novelty

Unified Single-Stage Generative Session Recommendation

Replaces the multi-stage retrieve-and-rank pipeline with a single encoder-decoder model that generates a full list (session) of items directly from user history
Uses a session-wise generation approach rather than next-item prediction, allowing the model to implicitly learn list-level context, coherence, and diversity
Employs Iterative Preference Alignment with a personalized reward model to select 'self-hard' negative samples from beam search results for DPO training

Architecture

The overall training pipeline of OneRec, including the session-wise generation task and the Iterative Preference Alignment (IPA) process.

Evaluation Highlights

Achieved a 1.6% increase in watch-time in online A/B testing on Kuaishou (a platform with hundreds of millions of DAUs)
Significantly outperforms strong baselines like SASRec and TIGER on offline metrics, particularly in session-based watch time (swt) and view probability (vtr)
Scaling the model using sparse Mixture-of-Experts (MoE) activates only 13% of parameters during inference while maintaining high model capacity

Breakthrough Assessment

8/10

One of the first successful industrial deployments of an end-to-end generative recommender that replaces, rather than augments, the traditional cascade ranking pipeline, with significant online gains.

⚙️ Technical Details

Problem Definition

Setting: Session-wise generative recommendation

Inputs: User historical behavior sequence H_u = {v_1, v_2, ..., v_n}

Outputs: A session list of videos S = {v_1, v_2, ..., v_m} generated auto-regressively

Pipeline Flow

Group: Offline Training -> Reward Model Training -> Iterative DPO
Group: Online Inference -> Encoder -> Decoder (MoE) -> Beam Search

System Modules

Item Tokenizer

Convert video embeddings into discrete hierarchical semantic IDs for the generative model

Model or implementation: Hierarchical K-Means (Balanced)

User Encoder (Generation Core)

Encode user historical behavior sequences into a latent representation

Model or implementation: Transformer Encoder (T5-based)

Session Decoder (MoE) (Generation Core)

Auto-regressively generate the semantic IDs of the target session videos

Model or implementation: Transformer Decoder with Sparse Mixture-of-Experts (MoE)

Reward Model

Score generated sessions to identify 'winner' and 'loser' samples for DPO training

Model or implementation: Target-aware attention + Multi-tower prediction network

Novel Architectural Elements

Replacement of the multi-stage cascade (recall->rank->rerank) with a single Encoder-Decoder generative model
Integration of Sparse MoE specifically within the decoder of a generative recommender to decouple capacity from inference cost
Iterative Preference Alignment loop where a Reward Model selects self-hard negatives from the generator's own Beam Search outputs

Modeling

Base Model: OneRec-1B (T5-like Encoder-Decoder with MoE)

Training Method: Two-stage training: (1) Next Token Prediction (NTP) on high-quality sessions, (2) Iterative Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Learn basic generation capability.

Formally: Cross-entropy loss on next semantic token L_NTP = - sum log P(s_t | s_<t, H_u)
Purpose: Train Reward Model to predict user engagement.

Formally: Binary Cross Entropy loss L_RM on user feedback (clicks/watches)
Purpose: Align model with high-reward sessions.

Formally: DPO loss L_DPO = - log sigma( beta * log(pi_theta(winner)/pi_ref(winner)) - beta * log(pi_theta(loser)/pi_ref(loser)) )

Training Data:

Training uses 'high-quality sessions' defined by: >5 videos watched, high total duration, or explicit interactions (like/share)
Preference pairs constructed via self-hard negative sampling: Beam search generates N=128 candidates; RM scores them; Best = winner, Worst = loser

Key Hyperparameters:

learning_rate: 2e-4
DPO_sample_ratio: 1%
beam_size: 128
+ 6 more
codebook_size_K: 8192
codebook_layers_L: 3
MoE_experts: 24
MoE_activated_experts: 2
session_length_m: 5
history_length_n: 256

Compute: Deployed on NVIDIA A800 GPUs. Inference uses float16 quantization and KV-cache. Only 13% parameters activated per token.

Comparison to Prior Work

vs. TIGER: OneRec uses session-wise generation instead of next-item prediction and integrates MoE and DPO
vs. SASRec: OneRec is a generative model predicting semantic IDs directly, rather than a discriminative ranking model
vs. S-DPO: OneRec uses a personalized Reward Model to select *self-hard* negatives (from the model's own beam search) rather than random negatives

Limitations

Relies on a separate Reward Model for constructing preference pairs, which adds complexity
Inference cost for autoregressive generation is generally higher than dot-product retrieval (mitigated here by MoE and caching)
Requires high-quality session data for initial training, which may be sparse in some domains
Specifics of the multimodal embedding model used for item quantization are not detailed

Reproducibility

Code availability is not provided. The method relies on proprietary industrial datasets (Kuaishou logs) and a complex online deployment infrastructure, making exact reproduction difficult without access to similar large-scale user interaction data.

📊 Experiments & Results

Evaluation Setup

Large-scale industrial short-video recommendation (Kuaishou). Both Offline evaluation and Online A/B testing.

Benchmarks:

Kuaishou Production Data (Short video recommendation) [New]

Metrics:

Session Watch Time (swt)
View Probability (vtr)
Follow Probability (wtr)
Like Probability (ltr)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Online A/B testing results on Kuaishou main scene.
Kuaishou Online	Watch Time	0.0	1.6	+1.6%
Offline comparisons against baselines show OneRec's superiority in session-based metrics.
Kuaishou Offline	swt (Session Watch Time)	10.456	11.102	+0.646
Kuaishou Offline	vtr (View Probability)	0.485	0.523	+0.038

Main Takeaways

Session-wise generation captures list-level context better than point-wise methods, leading to higher session watch times.
Iterative Preference Alignment (IPA) with self-hard negatives is crucial; standard DPO with random negatives yields smaller gains.
Scaling model capacity via MoE improves performance without a proportional increase in inference FLOPs, making large-scale generative recommendation feasible.
The method unifies recall and ranking into a single stage, simplifying the engineering pipeline while improving accuracy.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (Recall/Ranking pipeline)
Transformer Architectures (Encoder-Decoder)
Generative Retrieval / Semantic Indexing
Reinforcement Learning from Human Feedback (RLHF) / DPO

Key Terms

DPO: Direct Preference Optimization—a method to align models with preferences without explicitly training a reward model during the policy update, though OneRec uses a reward model to *select* the data for DPO

MoE: Mixture-of-Experts—a neural network architecture where different parts of the network ('experts') specialize in different inputs, allowing huge parameter counts with low inference cost

RQ-VAE: Residual Quantized Variational AutoEncoder—a method used to compress high-dimensional vectors (like item embeddings) into discrete codes (semantic IDs) for generation

Cascade Ranking: The traditional industrial standard pipeline consisting of multiple stages (recall, pre-ranking, ranking, re-ranking) to filter millions of items down to a few dozen

Self-hard negative sampling: A strategy where the model's own high-probability but low-reward generations are used as negative examples during training to force it to distinguish fine-grained differences

Session-wise generation: Generating a complete list of items (a session) in one go, rather than predicting just the single next item

Semantic IDs: Discrete tokens representing items (videos) derived from their content embeddings, allowing a language model to 'generate' items

IPA: Iterative Preference Alignment—repeatedly generating samples, scoring them with a reward model, and retraining the generator using DPO on the new data