← Back to Paper List

Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations

Bo Ma, LuYao Liu, Simon Lau, Chandler Yuan, and XueY Cui, Rosie Zhang
Department of Software & Microelectronics, Peking University, Civil, Commercial and Economic Law School, China University of Political Science and Law
arXiv (2025)
Recommendation MM P13N

📝 Paper Summary

LLM-based Recommendation Multimodal Recommendation Explainable AI
DynMM-Explain-LLMRec enhances frozen LLM recommenders by injecting real-time user preferences via lightweight adapters, fusing multimodal signals, and enforcing evidence-grounded explanations without retraining the base model.
Core Problem
Existing alignment-based recommenders rely on static training snapshots that miss evolving user preferences, fail to leverage non-textual item modalities (audio/video), and act as black boxes without verifiable explanations.
Why it matters:
  • Static models degrade quickly in real-world streaming scenarios where user interests drift rapidly (temporal drift)
  • Modern platforms (TikTok, YouTube) rely heavily on visual/audio cues which text-only LLMs ignore, limiting recommendation quality
  • Users increasingly require transparent justifications to trust algorithmic decisions, but current LLM explanations often hallucinate reasoning not backed by data
Concrete Example: A user recently started watching 'Space Survival' movies, but the static model only knows their old preference for 'RomComs'. Additionally, the model recommends a movie based solely on its title description, missing that its visual style matches the user's taste, and provides a generic explanation ('It is popular') rather than citing the specific shared attributes.
Key Novelty
DynMM-Explain-LLMRec
  • Augments a frozen 'base' recommender with a tiny, trainable online adapter that updates user/item representations in real-time based on a sliding window of recent interactions
  • Uses a shared latent space to fuse collaborative signals with visual (CLIP) and audio (Wav2Vec2) features, using a gating mechanism to handle missing modalities
  • Injects explicit 'evidence tokens' (top-k collaborative neighbors and key attributes) into the LLM prompt to ground explanations in actual data
Evaluation Highlights
  • Achieves 2.4% cumulative accuracy improvement (Hit@10) over static aligner baselines on Amazon Movies&TV, with dynamic adapters contributing +1.2%
  • Maintains efficiency with only 8.6% inference latency overhead compared to the base frozen model while adding real-time adaptation capabilities
  • Demonstrates robust multimodal fusion: performance drops only 0.8% when visual features are missing, compared to larger drops in baselines that cannot handle missing modalities
Breakthrough Assessment
7/10
Strong engineering framework addressing three critical gaps (dynamic, multimodal, explainable) simultaneously. While the components (adapters, fusion) are known, the unified application to frozen LLM recommenders is practical and effective.
×