Evaluation Setup
Ranking/Selection task: Select the correct ground-truth item from a candidate set of 10 items based on dialogue history.
Benchmarks:
- Reddit-Amazon (Fashion) (Visually-aware conversational recommendation) [New]
- Reddit-Amazon (Beauty) (Visually-aware conversational recommendation) [New]
- Reddit-Amazon (Home) (Visually-aware conversational recommendation) [New]
Metrics:
- Accuracy (Selection Accuracy)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LaViC significantly outperforms text-only baselines on the Reddit-Amazon dataset, demonstrating the value of visual information. |
| Reddit-Amazon (Fashion) |
Accuracy |
24.4 |
48.8 |
+24.4
|
| Reddit-Amazon (Home) |
Accuracy |
30.1 |
58.4 |
+28.3
|
| LaViC outperforms proprietary models like GPT-3.5 and performs competitively with GPT-4o. |
| Reddit-Amazon (Fashion) |
Accuracy |
34.5 |
48.8 |
+14.3
|
| Reddit-Amazon (Fashion) |
Accuracy |
47.2 |
48.8 |
+1.6
|
Main Takeaways
- Visual information is critical: Text-only baselines (LLaMA-2, GPT-3.5) consistently underperform compared to visually-aware LaViC across all categories.
- Compression works: Compressing images from ~2800 tokens to 5 tokens preserves enough information to outperform full-context models that struggle with context limits.
- Domain robustness: LaViC shows consistent improvements across Fashion, Beauty, and Home categories, verifying the method's applicability to various visual domains.