IDProxy: Cold-Start CTR Prediction for Ads and Recommendation at Xiaohongshu with Multimodal LLMs

📝 Paper Summary

Cold-Start Recommendation Click-Through Rate (CTR) Prediction Multimodal Representation Learning

IDProxy mitigates the item cold-start problem by using multimodal LLMs to generate proxy embeddings from content that are explicitly aligned with the collaborative ID embedding space of existing production CTR models.

Core Problem

Standard CTR models rely on item ID embeddings learned from interaction history, which fail for new items (cold-start) that lack this history.

Why it matters:

New items are continuously uploaded on platforms like Xiaohongshu and must be served immediately to ensure user experience, but lack the data needed for collaborative filtering
Existing solutions like simple MLP mappings fail to bridge the semantic gap between rich multimodal content and the irregular, non-clustered distribution of industrial ID embeddings
Retraining or significantly altering mature industrial ranking models to accommodate new content features is costly and operationally complex

Concrete Example: A newly uploaded post with an image and text has no click history, so its ID embedding is randomly initialized or poorly trained, causing the CTR model to rank it incorrectly. Existing methods might map its image features to an ID vector using a simple projection, but this vector often lands in a 'void' area of the ID space, disconnected from the collaborative patterns the ranker understands.

Key Novelty

Coarse-to-Fine Proxy Alignment with Multimodal LLMs (MLLMs)

Stage 1 (Coarse): Uses an MLLM to generate a global content embedding, aligned to the static ID space via contrastive learning against mature items
Stage 2 (Fine): Extracts multi-layer hidden states from the MLLM and refines them via an adaptor trained end-to-end with the frozen CTR ranker, allowing the proxy to learn ranker-specific structural priors

Architecture

The IDProxy framework illustrating the two-stage training process: MLLM-based coarse proxy generation and CTR-aware fine-grained alignment.

Breakthrough Assessment

7/10

Offers a practical, production-proven method for aligning MLLM representations with legacy ID-based systems without requiring a full model redesign. High industrial value, though the core concept of 'content-to-ID' mapping is established.

⚙️ Technical Details

Problem Definition

Setting: Cold-start CTR prediction where new items lack interaction history but possess multimodal content (text/image)

Inputs: User ID u, Item ID i (new), Context features x_ui, Item content (image + text)

Outputs: Predicted click probability y_hat, via generated proxy embedding p_i substituting for ID embedding e_i

Pipeline Flow

Proxy Generation: MLLM Content Encoder → Coarse Projection → Fine Adaptor → Gating
Ranking: Proxy Embeddings + User Features → CTR Model → Click Probability

System Modules

MLLM Content Encoder (Proxy Generation)

Extract rich semantic features from item text and images

Model or implementation: InternVL

Coarse Projector (Proxy Generation)

Map global MLLM representation to the ID embedding space

Model or implementation: MLP (phi)

Fine Adaptor (Proxy Generation)

Refine representations using multi-layer hidden states to capture fine-grained signals

Model or implementation: Lightweight MLP (phi_tilde)

Residual Gating (Proxy Generation)

Adaptively fuse coarse and fine proxies

Model or implementation: Gating weights W_c, W_g

CTR Ranker

Predict click probability using proxies and interaction features

Model or implementation: Production CTR model (architecture unspecified, includes feature interaction/attention)

Novel Architectural Elements

Injection of MLLM-derived proxy embeddings directly into the feature interaction and target attention modules of the CTR ranker during end-to-end training
Layer-wise clustering and pooling of MLLM hidden states (not just the final layer) to feed a lightweight adaptor

Modeling

Base Model: InternVL (MLLM)

Training Method: Two-stage training: (1) Contrastive alignment of MLLM to IDs, (2) End-to-end classification training with CTR ranker

Objective Functions:

Purpose: Align coarse proxy with ID embedding.

Formally: L_PAL = -log( exp(sim(h_i, e_i)/tau) / sum(exp(sim(h_i, e_j)/tau)) )
Purpose: Optimize CTR prediction accuracy.

Formally: L_CTR = -1/|B| sum( y_ui log(y_hat) + (1-y_ui) log(1-y_hat) )

Training Data:

Items filtered by frequency threshold tau to ensure high-quality ID embedding targets
ID embeddings L2 normalized before alignment

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 512
optimizer: AdamW

Comparison to Prior Work

vs. CB2CF/CLCRec: IDProxy uses MLLM hidden states (not just final output) and trains end-to-end with the ranker, rather than just mapping to a static ID space
vs. MOON: IDProxy explicitly reuses the ranker's feature interaction modules during alignment, whereas MOON relies on auxiliary objectives
vs. NoteLLM [not cited in paper]: NoteLLM generates hash-tags/categories; IDProxy generates dense vectors strictly aligned with a specific collaborative ID space

Limitations

Relies on the existence of a mature ID embedding space as a target; cannot function without a pre-trained ID-based CTR model
Two-stage training process is more complex than simple feature extraction
Inference latency for MLLM is high, requiring proxies to be pre-computed and stored (async update)

Reproducibility

Code not provided. Implementation details for the 'Base' industrial CTR system are omitted for confidentiality. InternVL is an open-source model.

📊 Experiments & Results

Evaluation Setup

Large-scale offline experiments and online A/B testing on Xiaohongshu (RedNote) platform

Benchmarks:

Xiaohongshu Explore Feed (Content Feed) (Production Recommendation)
Xiaohongshu Display Ads (Production Advertising)

Metrics:

Time Spent
Reads (Clicks)
Engagements (Likes, Comments)
Ads Metrics (Adv)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The system has been successfully deployed in production, serving hundreds of millions of users daily.
Quantitative results were not included in the provided text snippet, but the method claims to outperform the production baseline (Base) in both offline and online tests.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Click-Through Rate (CTR) prediction models
Familiarity with Embedding-based recommendation (Collaborative Filtering)
Basics of Multimodal Large Language Models (MLLMs)

Key Terms

CTR: Click-Through Rate—the probability that a user will click on a recommended item

Cold-Start: The scenario where a system must recommend new items that have little or no prior interaction data

ID Embedding: A dense vector representation of a specific item ID, learned from historical user interactions

MLLM: Multimodal Large Language Model—an AI model capable of processing and understanding both text and image inputs

Proxy Embedding: A synthetic embedding generated from content features designed to mimic the properties of a learned ID embedding

InternVL: A specific open-source Multimodal Large Language Model used as the backbone in this paper

Contrastive Learning: A training method that pulls representations of related pairs (e.g., content and ID of the same item) together while pushing unrelated pairs apart