← Back to Paper List

DMGIN: How Multimodal LLMs Enhance Large Recommendation Models for Lifelong User Post-click Behaviors

Zhuoxing Wei, Qingchen Xie, Qi Liu
arXiv (2025)
Recommendation MM Memory P13N

📝 Paper Summary

Click-Through Rate (CTR) Prediction Long-sequence User Behavior Modeling
DMGIN uses Multimodal LLMs to cluster shops into interest groups based on visual and textual similarity, compressing lifelong user behavior sequences for efficient and accurate CTR prediction.
Core Problem
Modeling lifelong user behavior sequences for CTR prediction is computationally expensive and inefficient due to sequence length, while existing multimodal integration methods struggle with architectural mismatches and increased burdens.
Why it matters:
  • Long post-click behavior sequences contain critical user interest data but create severe performance bottlenecks in training and inference
  • Two-stage retrieval methods often lose context by retrieving incomplete subsequences or removing duplicates that contain valuable temporal patterns
  • Directly embedding multimodal features into Large Recommendation Models (LRM) is computationally prohibitive for billions of daily interactions
Concrete Example: A user might repeatedly visit the same food delivery shop, viewing different dishes and descriptions over months. Standard models either treat every click as a separate long-sequence token (too expensive) or dedup items (losing frequency/timing data). DMGIN groups these repeated shop interactions into a single 'interest cluster' while retaining internal statistics.
Key Novelty
Deep Multimodal Group Interest Network (DMGIN)
  • Uses an offline MLLM to learn cross-modal shop representations (aligning text/images), then clusters all shops into semantic groups (e.g., 'fast food chains') to compress user sequences
  • Replaces raw long sequences of items with shorter sequences of interest groups, supplemented by intra-group statistics (frequency, time spent) to retain granularity without the computational cost
Architecture
Architecture Figure Figure 2
The overall architecture of DMGIN, illustrating the pipeline from raw behavior sequences to final interest representation.
Evaluation Highlights
  • +4.7% improvement in Click-Through Rate (CTR) in an online A/B test within a large-scale LBS advertising system
  • +2.3% increase in Revenue per Mile (RPM) in the same online A/B test
  • +4.5% CTR improvement reported in the contributions summary (likely referring to the same A/B test result, though text varies slightly between 4.5% and 4.7%)
Breakthrough Assessment
7/10
Strong industrial application showing how MLLMs can be used for offline structural optimization (clustering) rather than just online inference, solving the latency bottleneck of multimodal LRM.
×