LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation

📝 Paper Summary

Visually-aware conversational recommendation Vision-Language Models (VLMs)

LaViC compresses high-dimensional product images into minimal visual tokens via self-distillation, enabling Large Vision-Language Models to process multiple item candidates in conversational recommendation without exceeding context limits.

Core Problem

Processing multiple product images in conversational recommendation causes token explosion, exceeding VLM context windows and increasing computational cost.

Why it matters:

Standard VLMs (e.g., LLaVA) use thousands of tokens per image, making it impossible to analyze multiple retrieval candidates simultaneously
Text-only systems miss crucial visual details (style, color, design) essential for domains like fashion and home decor
Naive end-to-end fine-tuning of massive VLMs on limited recommendation data often leads to overfitting

Concrete Example: A user requests a 'hoodie-like military-style jacket with chest pockets.' Text descriptions alone might match multiple items, but verifying the exact pocket arrangement or silhouette requires visual inspection. Feeding 10 candidate images (each ~2,885 tokens) into a VLM exceeds typical 4k context limits.

Key Novelty

Two-stage Visual Compression and Recommendation Framework

Visual Knowledge Self-Distillation: Compresses thousands of image tokens into just 5 [CLS] embeddings per item by training the vision projector to reproduce detailed captions from these few tokens alone
Recommendation Prompt Tuning: Fine-tunes the LLM to take these compressed visual tokens and text context to select the correct item ID from a candidate list, avoiding hallucination

Architecture

The recommendation inference workflow of LaViC.

Evaluation Highlights

Significantly outperforms text-only baselines (e.g., +24.4% accuracy vs LLaMA-2-7B on Fashion subset)
Surpasses standard VLM baselines (e.g., LLaVA-v1.5) by effectively handling multiple images within context limits
Achieves competitive or superior accuracy compared to proprietary models like GPT-4o, despite being much smaller and open-source

Breakthrough Assessment

7/10

Effective solution to the multi-image token context bottleneck in VLMs. While the architecture relies on existing components (LLaVA), the self-distillation strategy for compression is practical and the new dataset fills a gap in visual conversational recommendation.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn conversational recommendation where the system ranks a candidate set of items based on dialogue history and item visual/textual features

Inputs: Dialogue history C, candidate item set I (each with title and image)

Outputs: Ranked list of items (specifically, the ID of the single ground-truth item)

Pipeline Flow

Visual Self-Distillation (Compress Image Tokens)
Recommendation Prompt Tuning (Fine-tune LLM on Compressed Tokens)

System Modules

Vision Encoder & Projector

Compress raw image patches into 5 [CLS] embeddings

Model or implementation: LLaVA-v1.6 vision tower (CLIP/SigLIP based) with LoRA

Large Language Model

Process dialogue and candidate items to predict target item ID

Model or implementation: Vicuna/Mistral (LLaVA backbone) with LoRA

Novel Architectural Elements

Replacement of standard full-patch visual tokens (2885 tokens) with sparse [CLS]-only representations (5 tokens) for multi-image inputs

Modeling

Base Model: LLaVA-v1.6 (uses Vicuna or Mistral as LLM backbone)

Training Method: Two-stage training: (1) Visual Distillation via LoRA, (2) Recommendation Tuning via LoRA

Objective Functions:

Purpose: Ensure compressed visual tokens can reproduce detailed image captions.

Formally: Autoregressive language modeling loss L_distill on generated description D_i given compressed tokens cls_{i,r}
Purpose: Train LLM to select correct item ID given context and compressed visuals.

Formally: Autoregressive loss L_rec on ground-truth ID id_{i*} given context and candidates X_{ij}

Adaptation: LoRA (Low-Rank Adaptation) applied to vision tower/projector in stage 1, and to LLM in stage 2

Trainable Parameters: Vision module parameters (Stage 1), LLM parameters (Stage 2) via LoRA

Training Data:

Reddit-Amazon dataset: ~19K conversations aligned with Amazon products across Beauty, Fashion, Home categories

Key Hyperparameters:

sub_images_per_item: 5
compressed_tokens_per_item: 5
original_tokens_per_item: 2885
+ 2 more
candidate_set_size: 10
distillation_epochs: Typically 1-2 (converges quickly)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaMA-2: LaViC incorporates visual modality which is crucial for aesthetic domains
vs. LLaVA-v1.5: LaViC compresses images to 5 tokens allowing 10+ items in context; standard LLaVA would overflow or require massive compute
vs. GPT-4o: LaViC is open-source, fine-tunable, and computationally cheaper while achieving comparable accuracy
+ 2 more
vs. VIP5 [not cited in paper]: VIP5 also uses recommendation via generation but typically processes single images or concatenates features differently without specific token compression for multi-item context
vs. P5 [not cited in paper]: P5 is a text-only recommendation foundation model; LaViC explicitly integrates vision

Limitations

Depends on candidate retrieval quality; if the correct item isn't in the top-10 candidates, the model cannot select it
Compression to 5 tokens might lose fine-grained visual details compared to full patch embeddings
Evaluation focuses on selection accuracy, not the fluency or helpfulness of generated conversational responses

Reproducibility

Code: https://github.com/jeon185/LaViC

Code and Reddit-Amazon dataset are publicly available at https://github.com/jeon185/LaViC. The paper specifies the base model (LLaVA-v1.6) and the two-stage training logic clearly.

📊 Experiments & Results

Evaluation Setup

Ranking/Selection task: Select the correct ground-truth item from a candidate set of 10 items based on dialogue history.

Benchmarks:

Reddit-Amazon (Fashion) (Visually-aware conversational recommendation) [New]
Reddit-Amazon (Beauty) (Visually-aware conversational recommendation) [New]
Reddit-Amazon (Home) (Visually-aware conversational recommendation) [New]

Metrics:

Accuracy (Selection Accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LaViC significantly outperforms text-only baselines on the Reddit-Amazon dataset, demonstrating the value of visual information.
Reddit-Amazon (Fashion)	Accuracy	24.4	48.8	+24.4
Reddit-Amazon (Home)	Accuracy	30.1	58.4	+28.3
LaViC outperforms proprietary models like GPT-3.5 and performs competitively with GPT-4o.
Reddit-Amazon (Fashion)	Accuracy	34.5	48.8	+14.3
Reddit-Amazon (Fashion)	Accuracy	47.2	48.8	+1.6

Experiment Figures

Perplexity (PPL) of generated image descriptions during the visual self-distillation training phase.

Main Takeaways

Visual information is critical: Text-only baselines (LLaMA-2, GPT-3.5) consistently underperform compared to visually-aware LaViC across all categories.
Compression works: Compressing images from ~2800 tokens to 5 tokens preserves enough information to outperform full-context models that struggle with context limits.
Domain robustness: LaViC shows consistent improvements across Fashion, Beauty, and Home categories, verifying the method's applicability to various visual domains.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) architecture (e.g., LLaVA)
Knowledge Distillation concepts
Prompt Tuning / LoRA

Key Terms

Visual tokens: Vector representations of image patches processed by a vision encoder

[CLS] embedding: A special token embedding commonly used in transformers to represent the aggregate meaning of a sequence (here, a sub-image)

Self-distillation: A process where a model teaches itself (or a modified version of itself) to perform a task, here compressing its own detailed visual representation into fewer tokens

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank decomposition matrices

Token explosion: The rapid increase in sequence length when processing multiple images, often exceeding the maximum context window of language models

LLaVA: Large Language-and-Vision Assistant—an open-source VLM that connects a vision encoder (CLIP) with an LLM