An Unified Search and Recommendation Foundation Model for Cold-Start Scenario

📝 Paper Summary

Multi-Domain Learning Cross-Domain Recommendation Large Language Models for Recommendation

A unified foundation model leverages LLM-extracted invariant text features, adaptive gating fusion, and domain-adaptive multi-task learning to transfer knowledge effectively to cold-start search and recommendation tasks.

Core Problem

Jointly modeling search and recommendation is difficult due to data imbalance, item heterogeneity across domains, and negative transfer, resulting in poor performance for cold-start scenarios.

Why it matters:

Single-domain models fail to capture user intent comprehensively because interactions are fragmented across apps
Cold-start scenarios (new services/products) lack sufficient interaction data, making traditional ID-based recommendation ineffective
Naive multi-task learning often suffers from negative transfer where dominant tasks degrade the performance of smaller tasks

Concrete Example: In a 'Super App' like Alipay, a user might click a service card in a recommendation feed. When they later search for that service, a standard model trained only on search data lacks the signal from the recommendation interaction, failing to rank the item correctly.

Key Novelty

S&R Multi-Domain Foundation Model

Uses LLMs to extract domain-invariant text features from queries and items, bridging the semantic gap between heterogeneous domains
Introduces Aspect Gating Fusion to dynamically weight the importance of ID, text, and sparse features based on domain contexts
employs a Domain Adaptive Layer with Jensen-Shannon divergence regularization to align feature distributions across different domains in a shared vector space

Architecture

The complete S&R Multi-Domain Foundation Model architecture.

Evaluation Highlights

Achieves +17.54% relative gain in PVCTR (Page View Click-Through Rate) over single-domain DNN baseline in an online A/B test for Service Card Recommendation
Outperforms SOTA multi-task baselines (PLE, MMoE) on 4 out of 7 industrial datasets, with AUC gains up to +0.0404 on Content Query Recommendation
Fine-tuning the foundation model improves AUC by +0.0279 on cold-start Content Query Recommendation compared to training from scratch

Breakthrough Assessment

7/10

Strong industrial application combining LLMs with multi-domain learning for practical cold-start gains. While the architectural components (MMoE, Domain Adaptation) are known, their specific integration with LLM features for S&R unification is novel and effective.

⚙️ Technical Details

Problem Definition

Setting: Multi-domain multi-task learning across K domains, predicting click-through rate (CTR) and query-item relevance

Inputs: User history U, Query Q (explicit or empty), Item I

Outputs: Probability of click p(y_ctr=1) and probability of relevance p(y_sim=1)

Pipeline Flow

User-Query-Item Encoding (ID + Sparse + LLM Text)
Aspect Gating Fusion (Merges ID, Text, Sparse)
Domain Adaptive Layer (Aligns distributions via Regularization)
Multi-Task Prediction Heads (CTR + Relevance)

System Modules

Feature Encoder

Converts raw inputs into embeddings

Model or implementation: Hybrid (Embedding Tables + LLM)

Aspect Gating Fusion

Dynamically merges ID, Text, and Sparse representations

Model or implementation: Domain-Specific Gating Network

Domain Adaptive Layer

Maps inputs from different domains to a common vector space

Model or implementation: Linear Transformation + Regularization

Multi-Task Head

Predicts task-specific targets

Model or implementation: MMoE (Multi-gate Mixture-of-Experts)

Novel Architectural Elements

Aspect Gating Fusion with Domain-Specific Gating to balance ID vs. LLM features dynamically per domain
Integration of Jensen-Shannon Divergence regularization within a Multi-Task Learning framework for S&R unification

Modeling

Base Model: Custom architecture combining Embedding Tables, LLM (BERT/ChatGLM), and MMoE

Training Method: Pretrain on multi-domain data, then Supervised Fine-Tuning (SFT) on target domain

Objective Functions:

Purpose: Predict Click-Through Rate.

Formally: L_ctr = Sum(CrossEntropy(f_theta(u,q,i), y_ctr))
Purpose: Predict Query-Item Relevance (Search only).

Formally: L_sim = Sum(CrossEntropy(f_phi(q,i), y_sim))
Purpose: Align feature distributions across domains.

Formally: L_reg = Sum(JS_Divergence(p(x_hat_i) || p(x_hat_j)))

Adaptation: Fine-tuning specific layers (freezing L0/L1 works best)

Trainable Parameters: Not reported in the paper

Training Data:

7 Industrial datasets from Alipay (Search, Rec, S/R mixed)
Sizes range from 0.76M to 146M samples

Key Hyperparameters:

LLM_output_dimension: 4096 (reduced to 32)
aspects: 3 (ID, Text, Sparse)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PLE/MMoE: Incorporates LLM-based invariant features and explicit domain-adaptive regularization (JS Divergence)
vs. STAR: Uses Aspect Gating to handle heterogeneous feature reliability (e.g., ID vs Text) rather than just topology adaptation
vs. JSR [not cited in paper]: Jointly models S&R via shared item sets but lacks the LLM-driven invariant feature extraction

Limitations

Relies on proprietary datasets, limiting reproducibility
Computation cost of LLM inference during serving/training is not detailed
Performance gain varies significantly across tasks (Task 2 shows minor regression vs MMoE)

Reproducibility

Code not provided. Datasets are proprietary industrial logs from Alipay. LLM backbones (BERT, ChatGLM) are open source, but the full S&R pipeline implementation is not.

📊 Experiments & Results

Evaluation Setup

Offline evaluation on 7 industrial datasets; Online A/B testing

Benchmarks:

Alipay Search & Rec Tasks (1-7) (CTR Prediction and Query Relevance) [New]

Metrics:

AUC (Offline)
PVCTR (Online)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against multi-task baselines on diverse industrial tasks.
Task 4 (Content Query Rec)	AUC	0.6575	0.6979	+0.0404
Task 6 (Service Card Rec)	AUC	0.8015	0.8312	+0.0297
Task 6 (Service Card Rec)	AUC	0.7978	0.8312	+0.0334
Ablation studies validating feature fusion and cold-start fine-tuning strategies.
Task 4 (Content Query Rec)	AUC	0.7385	0.7524	+0.0139
Task 4 (Content Query Rec)	AUC	0.7295	0.7574	+0.0279
Task 6 (Service Card Rec)	AUC	0.8229	0.8446	+0.0217

Experiment Figures

t-SNE visualization of domain embeddings for different model variants.

Online A/B testing PVCTR trends over 7 days for Service Card Recommendation.

Main Takeaways

The proposed S&R Foundation model generalizes better than standard MTL approaches (MMoE, PLE) across most tasks, especially smaller/harder ones
Domain Adaptive Regularization using Jensen-Shannon Divergence is crucial for aligning heterogeneous domains
Supervised Fine-Tuning (SFT) of the foundation model provides significant lift in cold-start scenarios compared to training from scratch
LLM-based text features combined with Domain-Specific Gating effectively mitigate the cold-start problem where ID features are sparse

📚 Prerequisite Knowledge

Prerequisites

Multi-Task Learning (MTL) architectures (MMoE, PLE)
Domain Adaptation techniques
Deep Learning for Recommendation (Embeddings, MLPs)
Large Language Models (LLMs)

Key Terms

MMoE: Multi-gate Mixture-of-Experts—an MTL architecture using gating networks to combine shared experts for different tasks

PLE: Progressive Layered Extraction—an MTL architecture that separates shared and task-specific experts to avoid negative transfer

Cold Start: The challenge of recommending items or serving users with little to no prior interaction history

PVCTR: Page View Click-Through Rate—the number of clicks divided by the number of page views

Jensen-Shannon Divergence: A symmetric measure of similarity between two probability distributions, used here to align domain embeddings

LLM: Large Language Model—used here (e.g., BERT, ChatGLM) to extract semantic features from text

S&R: Search and Recommendation—distinct but related tasks often modeled separately in industrial systems

AUC: Area Under the ROC Curve—a performance metric for classification tasks indicating the model's ability to distinguish between classes