LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

📝 Paper Summary

Display Advertising Generative Retrieval LLM Recommendation

LEADRE integrates Large Language Models into a large-scale industrial ad retrieval system by using semantic IDs, intent-aware prompting, and a hybrid latency-tolerant deployment architecture.

Core Problem

Traditional ID-based ad retrieval methods underutilize rich ad content (text/descriptions) and struggle to capture implicit user intent or diverse interests in sparse behavior scenarios.

Why it matters:

ID-based methods create 'information cocoons' by reinforcing existing preferences and lacking novelty.
Industrial display advertising lacks explicit queries, making it difficult to infer intent compared to search advertising.
Deploying LLMs at scale (tens of billions of requests) faces massive latency and cost constraints.

Concrete Example: In ID-based systems, a user with sparse history might only see ads similar to past clicks, missing relevant long-tail ads. LEADRE uses LLMs to reason over user profiles and cross-domain behaviors (e.g., news reading) to generate semantically relevant ad candidates that traditional collaborative filtering misses.

Key Novelty

Multi-Faceted Knowledge Enhanced LLM Retrieval (LEADRE)

Uses Semantic IDs (S-IDs) derived from ad text (via RQ-VAE) to bridge the gap between natural language generation and fixed ad inventories.
Constructs intent-aware prompts incorporating long-term/short-term interests and cross-domain behaviors (news, video) to mitigate data sparsity.
Employs a hybrid deployment strategy where LLMs generate candidates asynchronously (latency-tolerant) while a lightweight retrieval service fetches them in real-time.

Architecture

The overall LEADRE framework illustrating the three core modules: Intent-Aware Prompt Engineering, Advertising-Specific Knowledge Alignment, and Latency-Aware Model Deployment.

Evaluation Highlights

+1.57% GMV (Gross Merchandise Value) lift on Tencent WeChat Channels in online A/B testing.
+1.17% GMV lift on Tencent WeChat Moments in online A/B testing.
Significant improvement in HitRatio and NDCG metrics in offline experiments compared to SASRec and Text-based baselines.

Breakthrough Assessment

8/10

High score for successful industrial deployment handling billions of requests. While LLM retrieval exists in research, deploying it online in high-throughput ad systems with a hybrid latency architecture is a significant engineering and practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Next-item prediction (generative retrieval) in display advertising

Inputs: User behavior sequence S_u (ads and content items), user profile, and target ad inventory

Outputs: A generated ad a_{L+1} (represented as a sequence of Semantic IDs) relevant to the user

Pipeline Flow

Ad Indexing: Text -> Embedding -> RQ-VAE -> Semantic IDs
Offline/Near-line: Intent-Aware Prompting -> LLM Generation -> Semantic IDs
Online Serving: Hybrid retrieval of pre-generated LLM candidates

System Modules

Semantic ID Encoder

Converts ad text content into discrete Semantic IDs (S-IDs) for LLM processing

Model or implementation: Hunyuan (Encoder) + RQ-VAE

LLM Generator

Generates relevant ad S-IDs based on user context

Model or implementation: Hunyuan-1B (fine-tuned)

Latency-Aware Deployer

Decouples generation from real-time serving to meet latency constraints

Model or implementation: Hybrid Architecture (Async Generation + Real-time Lookup)

Novel Architectural Elements

Hybrid service framework decoupling LLM inference (latency-tolerant) from ad retrieval (latency-sensitive) via asynchronous pre-generation and caching
Hierarchical Semantic ID indexing system using RQ-VAE to map ads to LLM-compatible tokens

Modeling

Base Model: Hunyuan-1B

Training Method: Supervised Fine-Tuning (SFT) + Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Reconstruction and quantization of ad embeddings.

Formally: RQ-VAE Loss L = ||x - x^||^2 + ||sg[z_enc] - e||^2 + beta * ||sg[e] - z_enc||^2
Purpose: Next-token prediction for ad generation.

Formally: Autoregressive language modeling loss on S-ID sequences
Purpose: Align generation with business metrics (clicks/conversions).

Formally: DPO Loss L_DPO = -E[log sigmoid(beta * log(pi(y_w|x)/pi_ref(y_w|x)) - beta * log(pi(y_l|x)/pi_ref(y_l|x)))]

Adaptation: Full fine-tuning (implied by industrial scale context)

Training Data:

User behavior logs from Tencent advertising system
Enriched with cross-domain data (news/video interaction logs)

Compute: Deployed on clusters handling tens of billions of requests daily; optimized with TensorRT LLM

Comparison to Prior Work

vs. SASRec/GRU4Rec: Uses generative LLM with semantic IDs instead of discriminative matching on arbitrary IDs
vs. TIGER: Integrates multi-faceted knowledge (long/short term, cross-domain) and uses DPO for business alignment
vs. LC-Rec: Deploys in a high-concurrency industrial hybrid architecture rather than pure offline or low-scale online settings
+ 1 more
vs. P5 [not cited in paper]: Focuses on industrial display ad retrieval with DPO alignment, whereas P5 is a multi-task unified framework for recommendation

Limitations

Relies heavily on high-quality text metadata for ads; performance may degrade if ad descriptions are poor
Hybrid deployment introduces a freshness delay; real-time user actions might not immediately reflect in the async LLM generation path
Computational cost of maintaining 1B+ LLMs for billions of users is significant compared to lightweight ID embeddings

Reproducibility

Code not provided. Implementation relies on proprietary Tencent infrastructure and data (WeChat Channels/Moments logs). Models (Hunyuan) are proprietary.

📊 Experiments & Results

Evaluation Setup

Industrial Display Advertising (Offline dataset + Online A/B Test)

Benchmarks:

Industrial Ad Dataset (Sequential Recommendation / Ad Retrieval) [New]

Metrics:

HitRatio@K
NDCG@K
GMV (Gross Merchandise Value)
Stay Duration
Click-Through Rate (CTR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Online A/B tests demonstrate significant business value improvements across two major platforms.
WeChat Channels	GMV	0.0	1.57	+1.57%
WeChat Moments	GMV	0.0	1.17	+1.17%
WeChat Channels	Stay Duration	0.0	1.10	+1.10%
Ranking enhancement experiments show that using retrieved ads as features in downstream ranking further improves performance.
WeChat Channels	GMV	0.0	1.43	+1.43%

Experiment Figures

Trie-Tree Construction and Constrained Decoding logic.

Main Takeaways

Generative retrieval using LLMs significantly outperforms traditional ID-based methods (SASRec, etc.) in capturing user interest, especially for long-tail content.
Direct Preference Optimization (DPO) is crucial for aligning LLM outputs with commercial goals (GMV), not just semantic relevance.
The hybrid deployment architecture effectively balances the high latency of LLMs with the low latency requirements of online advertising.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (Retrieval/Ranking)
Large Language Models (LLMs) and Prompt Engineering
Vector Quantization (VQ-VAE / RQ-VAE)

Key Terms

Semantic IDs (S-IDs): Discrete tokens representing an item (ad) derived from its semantic content (text) rather than an arbitrary integer ID

RQ-VAE: Residual-Quantized Variational AutoEncoder—a model that compresses high-dimensional embeddings into a sequence of discrete codes (tokens) used as Semantic IDs

DPO: Direct Preference Optimization—a method to align language models with human/business preferences without a separate reward model

Trie-Tree: A prefix tree data structure used here to constrain the LLM's generation, ensuring it only produces valid sequences of Semantic IDs corresponding to real ads

GMV: Gross Merchandise Value—a total value of merchandise sold over a given period of time through a customer-to-customer (C2C) exchange site

HitRatio: A metric measuring the proportion of times the relevant item is present in the top-K recommended items

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

TensorRT: A high-performance deep learning inference optimizer and runtime library developed by NVIDIA

information cocoon: A situation where users are exposed only to information that reinforces their existing views or interests, limiting diversity

constrained decoding: Forcing an LLM to generate tokens only from a valid set of options (e.g., valid ad IDs) rather than open-ended text