IDProxy mitigates the item cold-start problem by using multimodal LLMs to generate proxy embeddings from content that are explicitly aligned with the collaborative ID embedding space of existing production CTR models.
Core Problem
Standard CTR models rely on item ID embeddings learned from interaction history, which fail for new items (cold-start) that lack this history.
Why it matters:
New items are continuously uploaded on platforms like Xiaohongshu and must be served immediately to ensure user experience, but lack the data needed for collaborative filtering
Existing solutions like simple MLP mappings fail to bridge the semantic gap between rich multimodal content and the irregular, non-clustered distribution of industrial ID embeddings
Retraining or significantly altering mature industrial ranking models to accommodate new content features is costly and operationally complex
Concrete Example:A newly uploaded post with an image and text has no click history, so its ID embedding is randomly initialized or poorly trained, causing the CTR model to rank it incorrectly. Existing methods might map its image features to an ID vector using a simple projection, but this vector often lands in a 'void' area of the ID space, disconnected from the collaborative patterns the ranker understands.
Key Novelty
Coarse-to-Fine Proxy Alignment with Multimodal LLMs (MLLMs)
Stage 1 (Coarse): Uses an MLLM to generate a global content embedding, aligned to the static ID space via contrastive learning against mature items
Stage 2 (Fine): Extracts multi-layer hidden states from the MLLM and refines them via an adaptor trained end-to-end with the frozen CTR ranker, allowing the proxy to learn ranker-specific structural priors
Architecture
The IDProxy framework illustrating the two-stage training process: MLLM-based coarse proxy generation and CTR-aware fine-grained alignment.
Breakthrough Assessment
7/10
Offers a practical, production-proven method for aligning MLLM representations with legacy ID-based systems without requiring a full model redesign. High industrial value, though the core concept of 'content-to-ID' mapping is established.
โ๏ธ Technical Details
Problem Definition
Setting: Cold-start CTR prediction where new items lack interaction history but possess multimodal content (text/image)
Inputs: User ID u, Item ID i (new), Context features x_ui, Item content (image + text)
Outputs: Predicted click probability y_hat, via generated proxy embedding p_i substituting for ID embedding e_i
Ranking: Proxy Embeddings + User Features โ CTR Model โ Click Probability
System Modules
MLLM Content Encoder (Proxy Generation)
Extract rich semantic features from item text and images
Model or implementation: InternVL
Coarse Projector (Proxy Generation)
Map global MLLM representation to the ID embedding space
Model or implementation: MLP (phi)
Fine Adaptor (Proxy Generation)
Refine representations using multi-layer hidden states to capture fine-grained signals
Model or implementation: Lightweight MLP (phi_tilde)
Residual Gating (Proxy Generation)
Adaptively fuse coarse and fine proxies
Model or implementation: Gating weights W_c, W_g
CTR Ranker
Predict click probability using proxies and interaction features
Model or implementation: Production CTR model (architecture unspecified, includes feature interaction/attention)
Novel Architectural Elements
Injection of MLLM-derived proxy embeddings directly into the feature interaction and target attention modules of the CTR ranker during end-to-end training
Layer-wise clustering and pooling of MLLM hidden states (not just the final layer) to feed a lightweight adaptor
Modeling
Base Model: InternVL (MLLM)
Training Method: Two-stage training: (1) Contrastive alignment of MLLM to IDs, (2) End-to-end classification training with CTR ranker
Items filtered by frequency threshold tau to ensure high-quality ID embedding targets
ID embeddings L2 normalized before alignment
Key Hyperparameters:
learning_rate: 1e-4
batch_size: 512
optimizer: AdamW
Comparison to Prior Work
vs. CB2CF/CLCRec: IDProxy uses MLLM hidden states (not just final output) and trains end-to-end with the ranker, rather than just mapping to a static ID space
vs. MOON: IDProxy explicitly reuses the ranker's feature interaction modules during alignment, whereas MOON relies on auxiliary objectives
vs. NoteLLM [not cited in paper]: NoteLLM generates hash-tags/categories; IDProxy generates dense vectors strictly aligned with a specific collaborative ID space
Limitations
Relies on the existence of a mature ID embedding space as a target; cannot function without a pre-trained ID-based CTR model
Two-stage training process is more complex than simple feature extraction
Inference latency for MLLM is high, requiring proxies to be pre-computed and stored (async update)
Reproducibility
Code not provided. Implementation details for the 'Base' industrial CTR system are omitted for confidentiality. InternVL is an open-source model.
๐ Experiments & Results
Evaluation Setup
Large-scale offline experiments and online A/B testing on Xiaohongshu (RedNote) platform
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
The system has been successfully deployed in production, serving hundreds of millions of users daily.
Quantitative results were not included in the provided text snippet, but the method claims to outperform the production baseline (Base) in both offline and online tests.
๐ Prerequisite Knowledge
Prerequisites
Understanding of Click-Through Rate (CTR) prediction models
Familiarity with Embedding-based recommendation (Collaborative Filtering)
Basics of Multimodal Large Language Models (MLLMs)
Key Terms
CTR: Click-Through Rateโthe probability that a user will click on a recommended item
Cold-Start: The scenario where a system must recommend new items that have little or no prior interaction data
ID Embedding: A dense vector representation of a specific item ID, learned from historical user interactions
MLLM: Multimodal Large Language Modelโan AI model capable of processing and understanding both text and image inputs
Proxy Embedding: A synthetic embedding generated from content features designed to mimic the properties of a learned ID embedding
InternVL: A specific open-source Multimodal Large Language Model used as the backbone in this paper
Contrastive Learning: A training method that pulls representations of related pairs (e.g., content and ID of the same item) together while pushing unrelated pairs apart