Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

📝 Paper Summary

LLM as Recommender System Benchmarking

RecBench systematically evaluates 17 LLMs against conventional recommenders across varying item representations and tasks, finding LLMs superior in accuracy but significantly less efficient.

Core Problem

Existing benchmarks for LLM-based recommenders are fragmented, often evaluating single scenarios with limited item representations (mostly text/ID) and lacking comprehensive comparisons with diverse traditional models and datasets.

Why it matters:

Industrial applications require balancing accuracy and efficiency; while LLMs show promise, their practical trade-offs against optimized conventional models remain unclear without a standardized comparison.
Current studies often overlook diverse item representations like semantic identifiers, leading to an incomplete understanding of how best to align LLMs with recommendation tasks.

Concrete Example: Previous benchmarks like LLMRec or PromptRec typically test only text-based or ID-based inputs on a single dataset like Amazon Beauty. They fail to reveal how an LLM performs when items are represented as hierarchical semantic codes (semantic identifiers) versus traditional IDs across diverse domains like News or Fashion.

Key Novelty

RecBench: A Multi-Dimensional Evaluation Framework

Systematically compares four distinct item representation forms (unique identifier, text, semantic embedding, semantic identifier) to determine optimal alignment between items and LLMs.
Evaluates both pair-wise (CTR) and list-wise (SeqRec) tasks across 5 diverse domains (Fashion, News, Video, Books, Music) using 17 different LLMs and diverse conventional baselines.
Introduces a 'Conditional Beam Search' (CBS) for semantic identifiers to ensure generated token sequences always map to valid items in the candidate set.

Architecture

The RecBench framework illustrating the Item Representation module, the two main Recommendation Tasks (Pair-wise CTR and List-wise SeqRec), and the Evaluation protocols.

Evaluation Highlights

LLM-based recommenders achieve up to +5% improvement in AUC for CTR prediction compared to conventional deep learning models.
In Sequential Recommendation, LLMs outperform baselines by up to +170% in NDCG@10.
Conventional models enhanced with LLM embeddings (LLM-for-RS) achieve 95% of the performance of standalone LLMs while maintaining significantly higher inference speed.

Breakthrough Assessment

8/10

Provides the most comprehensive and standardized benchmark to date for LLM-based recommendation, covering diverse representations and tasks, though primarily an evaluation paper rather than a new architectural proposal.

⚙️ Technical Details

Problem Definition

Setting: Recommender System Evaluation (CTR prediction and Sequential Recommendation)

Inputs: User behavior history sequences and candidate items, represented via IDs, text, embeddings, or semantic identifiers.

Outputs: Predicted click probability (for CTR) or next-item rank list (for SeqRec).

Pipeline Flow

Data Preprocessing (Sequence construction, Negative sampling)
Item Representation Generation (ID, Text, Embedding, Semantic ID)
Model Training/Fine-tuning (Traditional DLRMs vs LLMs via LoRA)
Inference & Evaluation (Score prediction or Constrained Generation)

System Modules

Item Representation Generator

Converts raw item data into model-digestible formats.

Model or implementation: Various (SentenceBERT + RQ-VAE for Semantic IDs; Llama-7B for Embeddings)

Recommender Model

Predicts user preference scores or next items.

Model or implementation: Diverse (DLRMs like DeepFM/SASRec vs LLMs like Llama-3-8B/P5)

Conditional Beam Search Decoder

Ensures generated semantic IDs map to valid items during list-wise inference.

Model or implementation: Algorithmic Constraint

Novel Architectural Elements

Systematic integration of Semantic Identifiers with Conditional Beam Search (CBS) to constrain LLM generation to valid item trees during sequential recommendation.

Modeling

Base Model: Evaluates 17 LLMs including Llama-3-8B, Qwen-1.5B, BERT-base, OPT-350M, RecGPT

Training Method: Supervised Fine-Tuning with LoRA (Low-Rank Adaptation)

Objective Functions:

Purpose: Optimize binary classification for click prediction.

Formally: Binary Cross-Entropy Loss (Eq 6 in paper).
Purpose: Optimize next-token prediction for sequential recommendation.

Formally: Categorical Cross-Entropy Loss (Eq 7 in paper).

Adaptation: LoRA (rank=32, alpha=128 for Pair-wise; rank=128, alpha=128 for List-wise)

Trainable Parameters: LoRA parameters only

Training Data:

5 Datasets: H&M (Fashion), MIND (News), MicroLens (Video), Goodreads (Books), Amazon CDs (Music).
Uniform preprocessing to approx similar sizes; User sequences truncated to max length 20.

Key Hyperparameters:

learning_rate_llm: 1e-4
learning_rate_dlrm: 1e-3
batch_size_dlrm: 5000
+ 4 more
batch_size_llm_large: 16 (for 7B models)
batch_size_llm_small: 64 (for <7B models)
codebook_layers: 4
codebook_size: 256

Compute: Single Nvidia A100 GPU

Comparison to Prior Work

vs. LLMRec: RecBench includes semantic identifiers and spans 5 diverse domains instead of just one.
vs. OpenP5: RecBench incorporates semantic embeddings and semantic IDs, not just unique IDs.
vs. TallRec [not cited in paper]: RecBench evaluates a wider range of base models (17 vs usually 1-2) and includes semantic ID analysis.

Limitations

Inference efficiency of LLMs is extremely low compared to DLRMs, hindering real-time deployment.
Requires converting all items to text or semantic IDs, which may be costly for massive catalogs.
Evaluation is limited to public datasets which may not fully reflect industrial scale data distribution.

Reproducibility

Code: https://recbench.github.io

publicly available (https://recbench.github.io). Data available at Kaggle link provided in abstract. Code includes implementation for all baselines and LLM fine-tuning scripts.

📊 Experiments & Results

Evaluation Setup

Unified evaluation on 5 datasets across 2 tasks (CTR and SeqRec).

Benchmarks:

H&M (Fashion Recommendation)
MIND (News Recommendation)
MicroLens (Video Recommendation)
Goodreads (Book Recommendation)
Amazon CDs (Music Recommendation)

Metrics:

GAUC (Group AUC)
nDCG@10
Inference Latency (ms)
Statistical methodology: Results averaged over 5 runs; significance tested at p < 0.05.

Experiment Figures

Performance (GAUC/NDCG) vs Efficiency (Latency) trade-off scatter plots for different model groups.

Main Takeaways

LLMs generally outperform conventional recommenders in accuracy, with semantic identifiers often yielding the best results in sequential tasks.
Efficiency is the major bottleneck: LLM inference is orders of magnitude slower than DLRMs.
LLM-for-RS (using LLM embeddings in DLRMs) offers a strong trade-off, achieving ~95% of LLM-as-RS performance with much lower latency.
Zero-shot performance of LLMs is significantly lower than fine-tuned performance, highlighting the necessity of adaptation.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (CTR, Sequential Recommendation)
Large Language Models (Fine-tuning, LoRA)
Item Representation learning

Key Terms

CTR: Click-Through Rate—a metric measuring the ratio of users who click on a specific link to the total total users who view a page, used here as a binary classification task.

SeqRec: Sequential Recommendation—predicting the next item a user will interact with based on their historical sequence of interactions.

Semantic Identifier: A representation where items are assigned unique, structured token sequences (often hierarchical) derived from their content semantics, rather than arbitrary integer IDs.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices.

Conditional Beam Search: A constrained decoding strategy used during inference to ensure that the sequence of tokens generated by the LLM corresponds to a valid item identifier in the pre-defined tree.

RQ-VAE: Residual Quantized Variational AutoEncoder—a method used to discretize continuous embeddings into a sequence of discrete codes (tokens) to create semantic identifiers.

LLM-for-RS: Using LLMs as feature extractors or auxiliary modules to enhance traditional recommendation models.

LLM-as-RS: Using LLMs directly as the recommender system, typically taking natural language prompts as input and generating recommendations.

GAUC: Group AUC—a variation of Area Under the Curve that calculates AUC for each user individually and then averages them, often weighted by impression count.