From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

📝 Paper Summary

Cross-Domain Recommendation Generative Recommendation

GenCDR replaces traditional item IDs with generative semantic IDs that dynamically fuse universal meaning and domain-specific traits, enabling effective cross-domain recommendation without shared identifiers.

Core Problem

Traditional cross-domain recommendation relies on shared IDs which are often unavailable, while current LLM-based methods suffer from tokenization gaps (vocabulary explosion) and fail to disentangle universal vs. domain-specific interests.

Why it matters:

Real-world platforms often lack aligned user/item IDs across domains (e.g., online content vs. offline services), rendering ID-based methods ineffective.
Existing LLM methods treat items as plain text or rigid indices, missing the nuanced evolution of user interests where the same concept (e.g., 'Apple') has different meanings across domains (tech vs. food).

Concrete Example: An 'Apple Watch' in a tech domain and a fresh 'Apple' in a lifestyle domain share the concept 'Apple'. Standard methods might confuse them or fail to capture that 'health' is key for the watch while 'sweet' is key for the fruit. GenCDR disentangles these by generating distinct but related semantic IDs.

Key Novelty

Generative Cross-Domain Recommendation with Domain-Adaptive Semantic IDs

Replaces arbitrary item IDs with discrete 'Semantic IDs' (SIDs) derived from text, ensuring items with similar meanings have similar codes even across domains.
Uses a 'Domain-Adaptive Tokenization' module that generates these SIDs by routing between a frozen universal encoder (capturing shared semantics) and domain-specific adapters (capturing unique traits).
Models user interests by dynamically fusing universal and domain-specific prediction distributions during the autoregressive generation process.

Architecture

The complete GenCDR framework, illustrating the two main parallel tracks: Item Tokenization (left) and User Recommendation (right), and their convergence during inference.

Evaluation Highlights

Significantly outperforms state-of-the-art baselines on multiple real-world datasets (exact improvement metrics not extracted from text but claimed as significant).
Effectively resolves the item tokenization dilemma by generating compact, transferable semantic IDs instead of expanding vocabulary.
Demonstrates superior generalization by effectively transferring knowledge even when user/item overlaps are sparse or non-existent.

Breakthrough Assessment

8/10

Novel approach to the 'ID problem' in cross-domain recommendation by moving to generative semantic IDs. The dual disentanglement (item-level and user-level) is theoretically sound and addresses a major bottleneck in LLM-based RecSys.

⚙️ Technical Details

Problem Definition

Setting: Cross-Domain Sequential Recommendation (CDSR) where user histories span multiple domains without necessarily sharing item IDs.

Inputs: A user's chronological interaction sequence across multiple domains, represented as semantic IDs.

Outputs: The next item (semantic ID sequence) the user is most likely to interact with in a target domain.

Pipeline Flow

Domain-adaptive Tokenization (Offline): Item Text → Universal Encoder + Domain Adapter → Router → Semantic IDs
Cross-Domain Autoregressive Recommendation (Training/Inference): User History (SIDs) → Universal LLM + Domain Adapter → Router → Next SID Prediction
Domain-aware Prefix-tree (Inference): Logits → Tree Constraint → Valid Item ID

System Modules

Universal Discrete Semantic Encoder (Tokenization)

Generate domain-agnostic semantic codes from item text

Model or implementation: RQ-VAE (Encoder-Decoder with Residual Quantization)

Domain-specific Semantic Adapters (Tokenization)

Refine universal representations to capture domain-specific features (e.g., visual style vs. narrative style)

Model or implementation: LoRA modules attached to the Universal Encoder

Item-level Dynamic Semantic Router (Tokenization)

Fuse universal and domain-specific representations based on item characteristics

Model or implementation: Lightweight MLP gate

Universal Interest Modeling Network (Recommendation)

Capture transferable user interest patterns across domains

Model or implementation: LLM with Mixture-of-LoRA adapters

Domain-specific User Adapters (Recommendation)

Capture domain-specific user preferences

Model or implementation: Dedicated LoRA adapter per domain

User-level Dynamic Interest Router (Recommendation)

Dynamically weight the universal and domain-specific logits during generation

Model or implementation: Lightweight MLP gate

Novel Architectural Elements

Symmetric dynamic routing networks at both item-tokenization level and user-preference level to disentangle universal vs. specific knowledge.
Integration of generative semantic IDs (RQ-VAE) specifically into a cross-domain LLM framework.
Domain-aware prefix-tree decoding to constrain LLM generation to valid cross-domain item IDs.

Modeling

Base Model: Large Language Model (specific architecture like Llama/Baichuan not explicitly named in extracted text, generally refers to 'LLM')

Training Method: Two-phase parameter-efficient fine-tuning (Universal phase then Domain-specific phase)

Objective Functions:

Purpose: Train universal item tokenizer.

Formally: L_pretrain = L_REC (reconstruction) + mu * L_Q (quantization) + lambda * L_MTM (masked token modeling).
Purpose: Adapt tokenizer to domains.

Formally: Self-supervised reconstruction loss + VIB loss (KL divergence) on router.
Purpose: Train universal recommendation model.

Formally: Standard autoregressive language modeling loss on aggregated SIDs.
Purpose: Train domain-specific recommendation adapters.

Formally: Autoregressive loss on domain-specific sequences.

Adaptation: LoRA (Low-Rank Adaptation) for both item encoding and user modeling

Key Hyperparameters:

statistical_methodology: Not explicitly reported in the paper

Comparison to Prior Work

vs. C2DSR/TriCDR: GenCDR does not require shared item IDs, using semantic text-derived IDs instead.
vs. TIGER/LC-Rec: GenCDR extends generative recommendation to multi-domain settings with explicit disentanglement of domain-shared and domain-specific semantics.
vs. LLM4CDSR: GenCDR uses discrete Semantic IDs to solve the vocabulary explosion problem and employs adaptive routing for better personalization.

Limitations

Dependency on rich textual descriptions for items to generate high-quality Semantic IDs.
Complexity of the two-stage training process (Tokenization training + Recommendation training).
Inference overhead from the dynamic routing and prefix-tree constraints, though tree search is optimized.

Reproducibility

Code: https://github.com/hupeiyu21/GenCDR

Code is publicly available at https://github.com/hupeiyu21/GenCDR. The paper describes the full training pipeline including loss functions and architectural components.

📊 Experiments & Results

Evaluation Setup

Cross-Domain Sequential Recommendation predicting next item in target domain given cross-domain history.

Benchmarks:

Real-world cross-domain datasets (Sequential Recommendation)

Metrics:

Not explicitly listed in snippet (likely Recall@K, NDCG@K typical for this task)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

GenCDR significantly outperforms state-of-the-art baselines in accuracy and generalization (quantitative values not in snippet).
The Domain-adaptive Tokenization effectively creates transferable IDs that capture semantic similarity across domains.
The dynamic routing mechanism successfully prevents negative transfer by filtering irrelevant domain-specific information.
The prefix-tree decoding strategy ensures 100% valid item generation, addressing a common failure mode in LLM-based recommendation.

📚 Prerequisite Knowledge

Prerequisites

Generative Recommendation (autoregressive prediction of item identifiers)
Vector Quantization (specifically RQ-VAE for discrete coding)
Parameter-Efficient Fine-Tuning (LoRA)
Variational Information Bottleneck (VIB) principle

Key Terms

Semantic IDs (SIDs): Discrete token sequences representing items, derived from their semantic content (e.g., text) rather than arbitrary integers.

RQ-VAE: Residual-Quantized Variational Autoencoder—a model that compresses high-dimensional vectors into a sequence of discrete codes.

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by injecting small, trainable low-rank matrices while freezing the main weights.

VIB: Variational Information Bottleneck—a regularization method that forces a model to learn a compressed representation retaining only task-relevant information.

Prefix-tree: A data structure used during inference to constrain the LLM's output to only valid sequences of tokens that correspond to real items.