DMGIN: How Multimodal LLMs Enhance Large Recommendation Models for Lifelong User Post-click Behaviors

📝 Paper Summary

Click-Through Rate (CTR) Prediction Long-sequence User Behavior Modeling

DMGIN uses Multimodal LLMs to cluster shops into interest groups based on visual and textual similarity, compressing lifelong user behavior sequences for efficient and accurate CTR prediction.

Core Problem

Modeling lifelong user behavior sequences for CTR prediction is computationally expensive and inefficient due to sequence length, while existing multimodal integration methods struggle with architectural mismatches and increased burdens.

Why it matters:

Long post-click behavior sequences contain critical user interest data but create severe performance bottlenecks in training and inference
Two-stage retrieval methods often lose context by retrieving incomplete subsequences or removing duplicates that contain valuable temporal patterns
Directly embedding multimodal features into Large Recommendation Models (LRM) is computationally prohibitive for billions of daily interactions

Concrete Example: A user might repeatedly visit the same food delivery shop, viewing different dishes and descriptions over months. Standard models either treat every click as a separate long-sequence token (too expensive) or dedup items (losing frequency/timing data). DMGIN groups these repeated shop interactions into a single 'interest cluster' while retaining internal statistics.

Key Novelty

Deep Multimodal Group Interest Network (DMGIN)

Uses an offline MLLM to learn cross-modal shop representations (aligning text/images), then clusters all shops into semantic groups (e.g., 'fast food chains') to compress user sequences
Replaces raw long sequences of items with shorter sequences of interest groups, supplemented by intra-group statistics (frequency, time spent) to retain granularity without the computational cost

Architecture

The overall architecture of DMGIN, illustrating the pipeline from raw behavior sequences to final interest representation.

Evaluation Highlights

+4.7% improvement in Click-Through Rate (CTR) in an online A/B test within a large-scale LBS advertising system
+2.3% increase in Revenue per Mile (RPM) in the same online A/B test
+4.5% CTR improvement reported in the contributions summary (likely referring to the same A/B test result, though text varies slightly between 4.5% and 4.7%)

Breakthrough Assessment

7/10

Strong industrial application showing how MLLMs can be used for offline structural optimization (clustering) rather than just online inference, solving the latency bottleneck of multimodal LRM.

⚙️ Technical Details

Problem Definition

Setting: Click-Through Rate (CTR) prediction in an industrial recommendation setting with multimodal content and lifelong user behavior sequences

Inputs: User profile, target item, and lifelong post-click behavior sequence (shops, timestamps, behaviors)

Outputs: Probability of user clicking the target item

Pipeline Flow

Offline: Cross-Modal Representation Learning (Align Text/Image)
Offline: Interest-Driven Entity Clustering (Group Shops)
Online: Sequence Compression (Map User History to Groups)
Online: Intra-Group Interest Enhancement (Statistics & Transformers)
Online: Temporal Group Evolution Transformer (Inter-Group Modeling)
Online: Target Attention & Prediction

System Modules

Cross-Modal Representation Learning Module (CMRLM) (Offline Processing)

Generate semantically aligned multimodal embeddings for shops

Model or implementation: CLIP-like dual-tower model

Interest-Driven Entity Clustering Module (IDECM) (Offline Processing)

Cluster shops into interest groups to compress sequences

Model or implementation: K-means clustering

Intra-Group Interest Enhancement Module (IGIEM) (Online Inference)

Capture fine-grained user interest within a specific group

Model or implementation: Statistical pooling + Intra-group Transformer (MHSA)

Temporal Group Evolution Transformer (TGET) (Online Inference)

Model the evolution of user interests across different groups over time

Model or implementation: Hierarchical Sequential Transduction Units (HSTU)

Target Attention (Online Inference)

Identify candidate-specific interest signals

Model or implementation: Attention Mechanism

Novel Architectural Elements

Hierarchical processing pipeline: Raw Behaviors → Intra-Group Statistics/Transformer → Inter-Group Transformer
Decoupled multimodal grouping: Using MLLM embeddings for offline clustering (grouping) rather than direct online inference input

Modeling

Base Model: Deep Multimodal Group Interest Network (DMGIN)

Training Method: Pre-training (CLIP-like) followed by Clustering and Supervised CTR training

Objective Functions:

Purpose: Align text and image representations.

Formally: Contrastive loss maximizing similarity of matched pairs.
Purpose: CTR Prediction.

Formally: Binary Cross Entropy loss (standard for CTR tasks, implied).

Training Data:

Multimodal pairs: shop name/images, food name/descriptions, keywords/scene images

Compute: Not reported in the paper

Comparison to Prior Work

vs. SIM/TWIN: DMGIN groups by multimodal semantic similarity rather than just ID/Category, and models intra-group dynamics explicitly
vs. QARM/BBQRec: DMGIN focuses on utilizing MLLM for *structure reorganization* (grouping/clustering) of long sequences, rather than just feature alignment
vs. Standard LRM [not cited in paper]: Standard LRMs typically use ID-based embeddings; DMGIN introduces multimodal-derived groupings to compress sequence length while keeping semantic richness

Limitations

Grouping inevitably disrupts original sequence structure, potentially causing information loss (mitigated by intra-group modules but not eliminated)
Relies on offline pre-computation of clusters; might lag in capturing very new or rapidly changing shop characteristics without re-clustering
Requires significant offline processing for MLLM inference on all entities (shops)

Reproducibility

No replication artifacts mentioned in the paper. Code URL not provided. Dataset specifics (industrial vs public split) are mentioned generally but not accessible.

📊 Experiments & Results

Evaluation Setup

Deployed in a large-scale Location-Based Services (LBS) advertising system

Benchmarks:

Industrial LBS Dataset (CTR Prediction (Online A/B Test)) [New]
Public Datasets (CTR Prediction)

Metrics:

Click-Through Rate (CTR)
Revenue per Mile (RPM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Online A/B testing results in a live industrial LBS advertising system.
Industrial LBS System	CTR	Not reported in the paper	Not reported in the paper	+4.7%
Industrial LBS System	Revenue per Mile (RPM)	Not reported in the paper	Not reported in the paper	+2.3%

Main Takeaways

DMGIN successfully reduces user sequence length from tens of thousands to hundreds via multimodal grouping.
The method improves both accuracy (CTR) and revenue (RPM) in a real-world industrial setting.
Intra-group strategies (statistics + transformer) are effective in mitigating information loss caused by the clustering process.

📚 Prerequisite Knowledge

Prerequisites

Basics of Click-Through Rate (CTR) prediction
Transformer architectures (attention mechanisms)
Multimodal learning (CLIP-style alignment)
K-means clustering

Key Terms

LRM: Large Recommendation Models—industrial-scale systems designed to handle massive user-item interaction data

MLLM: Multimodal Large Language Models—AI models capable of processing and generating information from multiple modalities like text and images

CTR: Click-Through Rate—the ratio of users who click on a specific link to the total number of users who view a page, advertisement, or email

MHSA: Multi-Head Self Attention—a mechanism in Transformers that allows the model to jointly attend to information from different representation subspaces

LBS: Location-Based Services—services offered through a mobile device that take into account the device's geographical location

HSTU: Hierarchical Sequential Transduction Units—a specific transformer-based component used for modeling long sequences efficiently in recommendation systems

RPM: Revenue per Mile—revenue generated per 1,000 impressions (often used interchangeably with RPM or eCPM in advertising contexts)

CLIP: Contrastive Language-Image Pre-training—a model trained to predict which caption goes with which image, learning aligned multimodal representations