When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study

📝 Paper Summary

Multimodal Sequential Recommendation (MSR) Large Vision Language Models (LVLMs) Generative Recommendation

MSRBench evaluates three commercial Large Vision Language Models across five integration strategies for sequential recommendation, finding that using LVLMs as rerankers is the most effective but computationally expensive approach.

Core Problem

Traditional multimodal sequential recommenders use shallow alignment that may overlook intricate correlations between modalities, while the best strategies for integrating powerful Large Vision Language Models (LVLMs) into these systems remain unstudied.

Why it matters:

Current multimodal systems struggle to fully leverage complex visual-textual relationships crucial for web-based content
There is no systematic benchmark comparing different roles LVLMs can play (e.g., direct recommender vs. enhancer vs. reranker)
Industry adoption requires understanding the trade-offs between recommendation accuracy gains and the high computational cost of LVLMs

Concrete Example: In a SASRec system, a user who buys dolls receives a generic 'finger puppet' recommendation because the model misses the specific visual style preference. In contrast, an LVLM acting as a reranker can analyze the visual consistency of the user's history and correctly recommend a 'collectible doll' instead.

Key Novelty

MSRBench: A Systematic Evaluation of LVLM Roles in Recommendation

Defines five distinct strategies for LVLM integration: direct recommender, item enhancer (captioning), reranker, and hybrid combinations
Constructs 'Amazon Review Plus', an augmented dataset where every item image is captioned by three state-of-the-art LVLMs to enable text-rich modeling
Conducts the first comprehensive empirical study comparing commercial LVLMs (GPT-4V, GPT-4o, Claude-3) against specialized multimodal baselines

Architecture

Five different strategies (S1-S5) for integrating LVLMs into the recommendation pipeline

Evaluation Highlights

GPT-4o as a reranker (Strategy 3) achieves 38.85% H@1 on the Beauty dataset, significantly outperforming the best baseline FREEDOM (33.00%)
Using LVLMs as rerankers (S3) consistently outperforms using them as direct recommenders (S1), with S3 achieving ~1.6x higher H@1 than S1 for GPT-4o on Beauty (38.85% vs 23.37%)
LVLM inference is computationally expensive: S3 (Reranker) with GPT-4V requires ~42.49 seconds per user, compared to 0.0025 seconds for traditional baselines

Breakthrough Assessment

7/10

Provides a crucial, systematic empirical foundation for LVLM usage in recommendation. While methodologically standard (benchmarking existing models), the insights on strategy trade-offs and efficiency are highly valuable for the field.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Sequential Recommendation (Next Item Prediction)

Inputs: Sequence of user's historical interactions, where each item has both a title (text) and an image (vision)

Outputs: A ranked list of candidate items predicted to be the user's next interaction

Pipeline Flow

Input: User Interaction History (Images + Titles)
Strategy Selection (S1-S5)
LVLM Processing (e.g., GPT-4o)
Output: Ranked Item List

System Modules

LVLM as Direct Recommender (S1)

Directly predict next item from concatenated history images and titles

Model or implementation: GPT-4V / GPT-4o / Claude-3-Opus

LVLM as Item Enhancer (S2)

Generate descriptive captions for item images to enrich textual metadata

Model or implementation: GPT-4V / GPT-4o / Claude-3-Opus

LVLM as Reranker (S3)

Re-order a candidate list provided by a base recommender using multimodal context

Model or implementation: GPT-4V / GPT-4o / Claude-3-Opus

Novel Architectural Elements

Five distinct integration architectures (S1-S5) defining different functional roles for LVLMs within the recommendation pipeline
Concatenated image input mode (Mode 1) for S1, where history items are stitched into a single visual prompt

Modeling

Base Model: GPT-4o, GPT-4 Vision, Claude-3-Opus

Training Method: Zero-shot prompting for LVLMs; Standard training for baselines (SASRec, MMGCN, etc.)

Compute: Inference time: S3 (Reranker) takes ~24.49s/user (GPT-4o) vs 0.0025s/user (SASRec). Training time: Baselines take ~5-400s per epoch on GPU.

Comparison to Prior Work

vs. MoRec: Uses LVLMs for reasoning/reranking rather than just feature extraction
vs. Rec-GPT4V: Systematically benchmarks 5 distinct strategies (including enhancer/reranker combinations) rather than just direct recommendation
vs. LlamaRec: Incorporates visual modality (LVLM) rather than text-only LLM reranking
+ 1 more
vs. UniMP [not cited in paper]: Focuses on commercial closed-source LVLMs evaluation rather than training unified multimodal personalization models

Limitations

Computational inefficiency makes real-time deployment of reranking strategies (S3, S5) currently impractical (high latency)
Study is limited to closed-source commercial models; open-source LVLMs (Qwen-VL, GLM-4V) were excluded due to poor instruction following
Evaluation uses a small candidate subset (1 positive + 29 negatives) rather than full-ranking due to cost and context window limits
Did not explore fine-tuning of LVLMs, only zero-shot prompting strategies

📊 Experiments & Results

Evaluation Setup

Sequential recommendation (next-item prediction) on e-commerce datasets

Benchmarks:

Amazon Review Plus (Beauty) (Sequential Recommendation) [New]
Amazon Review Plus (Sports) (Sequential Recommendation) [New]
Amazon Review Plus (Toys) (Sequential Recommendation) [New]
Amazon Review Plus (Clothing) (Sequential Recommendation) [New]

Metrics:

Hit Ratio @ k (H@1, H@5)
NDCG @ k (N@5)
Statistical methodology: Paired t-test with p<0.01 for significance testing

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance of different LVLM integration strategies (S1-S5) against state-of-the-art baselines.
Beauty	H@1	33.00	38.85	+5.85
Sports	H@1	33.75	33.00	-0.75
Toys	H@1	33.25	40.50	+7.25
Beauty	H@1	23.37	38.85	+15.48
Beauty	H@1	31.71	38.85	+7.14

Experiment Figures

Impact of input modalities (Title only, Image only, Title+Image) on performance (S1 strategy)

Case study comparing SASRec vs. GPT-4o (S3/S5) recommendations

Main Takeaways

Reranking (S3) is the dominant strategy, consistently outperforming direct recommendation (S1) and item enhancement (S2) across almost all categories and models.
Combining strategies (S4/S5) yields mixed results; adding item enhancement to reranking (S5) helps in some categories (Toys) but hurts in others (Beauty) compared to pure reranking (S3).
GPT-4o is superior to GPT-4V and Claude-3-Opus for recommendation tasks, particularly in handling multimodal inputs for reranking.
Image-only inputs for direct recommendation perform very poorly (close to random); text titles are essential for accurate prediction.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation (predicting next item based on history)
Multimodal Learning (processing text and images together)
Large Vision Language Models (LVLMs) and Prompt Engineering

Key Terms

MSR: Multimodal Sequential Recommendation—systems that use both visual and textual data to predict user preferences

LVLM: Large Vision Language Model—models like GPT-4V capable of processing and reasoning about both images and text

SASRec: Self-Attentive Sequential Recommendation—a standard baseline model using attention mechanisms to capture sequential patterns

H@k: Hit Ratio at k—the percentage of times the ground-truth item appears in the top-k recommended items

N@k: Normalized Discounted Cumulative Gain at k—a ranking metric that accounts for the position of the correct item in the list

Reranker: A strategy where a model re-orders a candidate list generated by another retrieval system, rather than searching the entire catalog itself

Item Enhancer: A strategy using LVLMs to generate rich textual descriptions (captions) from item images to augment metadata

Hallucination: When a generative model recommends items that do not exist or are not in the valid candidate list