Generative Recommendation for Large-Scale Advertising

📝 Paper Summary

Generative Recommendation Computational Advertising Large-Scale Recommender Systems

GR4AD adapts generative recommendation for high-throughput advertising by introducing lazy autoregressive decoding for speed and a list-wise reinforcement learning objective for business value alignment.

Core Problem

Adapting LLM-style generative recommendation to real-time advertising fails because standard decoding is too slow for high-traffic multi-candidate generation, and standard training ignores list-wise business metrics like eCPM and NDCG.

Why it matters:

Advertising systems have strict latency budgets (<100ms) that interactive LLM serving techniques cannot meet when generating hundreds of candidates
Directly applying next-token prediction aligns with semantic probability but fails to capture the ranked-list utility crucial for ad revenue (eCPM)
Standard tokenization misses critical non-semantic business signals (e.g., conversion type, account ID) that drastically alter ad delivery logic

Concrete Example: Identical ad creatives (same video) might target completely different users based on conversion goals (e.g., 'app install' vs 'purchase'). A standard semantic ID model treats them as identical, causing collisions. GR4AD avoids this by hashing non-semantic business signals into the final token layer.

Key Novelty

Production-Oriented Generative Ad Recommendation (GR4AD)

LazyAR Decoder: Delays autoregressive dependencies to later layers, allowing the first K layers to be computed once in parallel for all beams, drastically speeding up multi-candidate generation
RSPO (Ranking-Guided Softmax Preference Optimization): A list-wise reinforcement learning objective that directly optimizes ranking metrics (NDCG) derived from business values (eCPM) rather than just next-token likelihood
UA-SID (Unified Ad Semantic ID): Fuses multimodal content semantics with hash-based business signals (conversion type) to create collision-free, meaningful discrete identifiers

Architecture

Comparison of Standard Autoregressive, DeepSeek-MTP, and the proposed LazyAR decoder architectures.

Evaluation Highlights

+4.2% ad revenue improvement (RPM) in online A/B tests against a production DLRM-based stack serving 400M users
Achieves >500 QPS per L20 GPU with <100ms latency, enabling real-time generative serving at massive scale
+1.1% revenue gain specifically from the RSPO alignment component compared to standard supervised fine-tuning baselines

Breakthrough Assessment

9/10

Successfully deploys generative recommendation in a massive-scale, latency-critical ad system (400M users), solving critical bottlenecks in serving efficiency (LazyAR) and value alignment (RSPO) that previously hindered industrial adoption.

⚙️ Technical Details

Problem Definition

Setting: Generative retrieval and ranking for advertising, where a user context maps to a list of discrete item identifiers (SIDs)

Inputs: User interaction sequence and context features processed into dense embeddings X

Outputs: A ranked list of Unified Advertisement Semantic IDs (UA-SIDs) y representing recommended ads

Pipeline Flow

Input Processing: User history & context → Linear Context Processor → Context Embeddings X
Decoding: X → LazyAR Decoder → UA-SID Sequence Generation
Serving Optimization: Dynamic Beam Serving & Cache → Final Candidates

System Modules

Linear Context Processor

Efficiently encodes heterogeneous user behavior and context features into dense vectors

Model or implementation: Linear projection layers (following LazyDecoder/OneRecV2)

LazyAR Decoder

Generates Semantic IDs using a split architecture: parallel prefix layers (shared across beams) and autoregressive suffix layers

Model or implementation: Modified Transformer Decoder

Dynamic Beam Serving

Adjusts beam width dynamically based on generation depth and real-time system load

Model or implementation: Heuristic controller

Novel Architectural Elements

LazyAR Decoder: Splits decoder into a parallel prefix stack (independent of previous token) and an autoregressive suffix stack, fused via a gated projection
UA-SID Construction: Hybrid quantization replacing the final VQ layer with hash-based mapping of non-semantic business signals (conversion type, account ID)

Modeling

Base Model: Transformer Decoder (custom LazyAR architecture)

Training Method: Joint Value-Aware Supervised Learning (VSL) and Ranking-Guided Softmax Preference Optimization (RSPO)

Objective Functions:

Purpose: Learn basic user interest distribution weighted by user value.

Formally: VSL loss is weighted cross-entropy of next-token prediction + auxiliary MTP loss for prefix layers.
Purpose: Align generation with high-value (eCPM) rankings.

Formally: RSPO loss minimizes KL-divergence dependent on ranking-based weights M_ij derived from LambdaRank/NDCG deltas.
Purpose: Balance imitation and exploration dynamically.

Formally: Combined loss uses an alignment score A(i) to dynamically weight VSL vs. RSPO per sample.

Training Data:

Samples from Kuaishou advertising logs
Heterogeneous sources: model-generated lists + logs from other production pipelines

Key Hyperparameters:

lazy_layer_split_K: 2/3 of total layers (typically)
ranking_preference_beta: Not explicitly reported in the paper
mtp_loss_weight: Not explicitly reported in the paper

Compute: Inference: <100ms latency, >500 QPS per NVIDIA L20 GPU

Comparison to Prior Work

vs. TIGER/OneRec: GR4AD uses LazyAR for faster serving and hybrid semantic/hash IDs for business logic
vs. DeepSeek-MTP: LazyAR splits the *main* backbone into parallel/autoregressive parts to save compute, rather than adding speculative heads [not cited in paper]
vs. DPO: RSPO explicitly optimizes list-wise NDCG rather than pairwise preferences

Limitations

LazyAR design is specific to recommendation (beam search heavy) and may not benefit standard LLM chat (greedy/sampling)
Relies on proprietary business signals (conversion types) for tokenization, making it hard to replicate outside ad settings
No public release of code or datasets

Reproducibility

No replication artifacts mentioned in the paper. Code is proprietary (Kuaishou Technology). Data is private industrial logs. Instructions for UA-SID generation are in appendix.

📊 Experiments & Results

Evaluation Setup

Large-scale online A/B testing in Kuaishou advertising system and offline replay on industrial logs

Benchmarks:

Kuaishou Online Production (Ad Recommendation (CTR/CPM))
Offline Industrial Dataset (Sequential Recommendation) [New]

Metrics:

RPM (Revenue Per Mille)
CPM (Cost Per Mille)
CTR (Click-Through Rate)
CVR (Conversion Rate)
GAUC (Group AUC)
NDCG
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Online A/B tests demonstrate significant revenue gains over the existing production baseline.
Kuaishou Online Production	Ad Revenue (RPM)	0.0	4.2	+4.2
Kuaishou Online Production	CTR	0.0	1.8	+1.8
Ablation studies confirm the effectiveness of the RSPO alignment method.
Kuaishou Online Production	Ad Revenue	0.0	1.1	+1.1
Inference speedup results show LazyAR significantly improves throughput.
Inference Latency	Throughput (QPS)	240	510	+270

Experiment Figures

Ablation of K (split point) in LazyAR on model performance (Recall) and latency.

Main Takeaways

LazyAR doubles inference throughput (QPS) compared to standard autoregressive decoding by parallelizing the first 2/3 of layers, without accuracy loss.
RSPO effectively translates business value (eCPM) into gradient signals, improving revenue metrics (+1.1%) over standard supervised learning.
Hybrid Semantic/Hash IDs (UA-SID) are crucial for distinguishing identical creatives with different business targets, resolving collision issues inherent in pure semantic IDs.

📚 Prerequisite Knowledge

Prerequisites

Generative Recommendation (Semantic IDs)
Transformer Decoder Architectures
Reinforcement Learning (Policy Optimization)
Vector Quantization (RQ-VAE / K-means)

Key Terms

UA-SID: Unified Advertisement Semantic ID—a discrete identifier for ads created by hierarchically quantizing multimodal embeddings and hashing non-semantic business signals

LazyAR: Lazy AutoRegression—a decoder architecture that processes initial layers in parallel (ignoring previous tokens) and only introduces autoregressive dependencies in later layers to speed up beam search

RSPO: Ranking-Guided Softmax Preference Optimization—a reinforcement learning algorithm that aligns model probabilities with list-wise ranking metrics (NDCG) based on ad value

VSL: Value-Aware Supervised Learning—a training objective that weights standard next-token prediction loss by the user's long-term value and interaction depth

eCPM: effective Cost Per Mille—a metric representing the revenue generated per 1,000 ad impressions, used here as the reward signal

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for position and item relevance

SFT: Supervised Fine-Tuning—training the model on labeled data (user history) before applying reinforcement learning

Beam Search: A search algorithm that explores a graph by expanding the most promising node in a limited set

RQ-Kmeans: Residual Quantized K-means—a method to compress vectors into discrete codes by recursively clustering residuals

MTP: Multi-Token Prediction—a training technique (often auxiliary) where the model predicts multiple future tokens at once to improve representation learning

DLRM: Deep Learning Recommendation Model—the standard non-generative architecture for recommendation, typically using embedding tables and MLPs