Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy Unleashes the Potential of Traditional Sequential Recommenders

📝 Paper Summary

Sequential Recommendation Loss Functions

Traditional sequential recommenders trained with Cross-Entropy loss outperform LLM-based recommenders, and a new Scaled Cross-Entropy loss allows them to scale efficiently while maintaining this superiority.

Core Problem

Prior comparisons between LLM-based recommenders and traditional models are unfair because traditional models are typically trained with suboptimal pointwise/pairwise losses (BCE/BPR) while LLMs use Cross-Entropy (CE).

Why it matters:

Leads to over-confidence in LLM ranking capabilities and massive computational waste
Underestimates the potential of efficient traditional architectures like SASRec when properly optimized
Existing sampling methods for large item spaces (like sampled softmax) often degrade performance due to poor tightness

Concrete Example: When trained with BCE, SASRec underperforms LlamaRec on the Beauty dataset. However, simply switching SASRec's loss to full softmax Cross-Entropy allows it to surpass LlamaRec, revealing that the performance gap was due to the loss function, not model architecture.

Key Novelty

Scaled Cross-Entropy (SCE) for Sequential Recommendation

Demonstrates theoretically that an ideal recommendation loss requires both 'tightness' (good proxy for ranking metrics) and 'coverage' (sufficient negative samples)
Proposes Scaled Cross-Entropy (SCE) which scales up the sampled normalization term to approximate the tightness of full softmax while maintaining efficiency
Re-benchmarks traditional models (SASRec, FMLP-Rec) with CE/SCE, proving they outperform fine-tuned LLMs

Architecture

A conceptual illustration distinguishing different loss functions based on Tightness and Coverage properties.

Evaluation Highlights

Traditional SASRec with Cross-Entropy outperforms fine-tuned LlamaRec (7B) by ~23% on Beauty dataset (NDCG@5: 0.0886 vs 0.0718)
Proposed SCE loss matches full Cross-Entropy performance with only 500 negative samples, while standard Sampled Softmax degrades significantly
FMLP-Rec with Cross-Entropy achieves state-of-the-art results on Yelp, surpassing P5 and LlamaRec

Breakthrough Assessment

8/10

Strongly challenges the prevailing narrative that LLMs are superior for sequential recommendation by exposing a fundamental flaw in baseline comparisons. Offers a simple, effective fix (SCE) that restores the viability of traditional models.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation: Predict the next item v_{t+1} given a sequence of historical interactions q = [v_1, ..., v_t].

Inputs: User interaction sequence q

Outputs: Ranked list of items from the candidate set I

Pipeline Flow

Input Sequence Processing
Encoder (Transformer/CNN/RNN/MLP)
Scoring Function (Inner Product)
Loss Computation (CE or SCE)

System Modules

Encoder

Encodes user interaction history into a latent vector

Model or implementation: SASRec (Transformer), FMLP-Rec (Filter-enhanced MLP), GRU4Rec (RNN), or Caser (CNN)

Scoring Function

Computes relevance scores between user embedding and all item embeddings

Model or implementation: Inner Product

Novel Architectural Elements

Integration of Scaled Cross-Entropy (SCE) loss mechanism into traditional recommender training pipelines to approximate full softmax dynamics efficiently

Modeling

Base Model: SASRec (Transformer-based sequential recommender)

Training Method: Supervised learning on interaction sequences

Objective Functions:

Purpose: Approximate full softmax Cross-Entropy efficiently.

Formally: SCE uses a sampled normalization term scaled by |I|/m, where |I| is total items and m is sample size.

Key Hyperparameters:

batch_size: 256
learning_rate: 1e-3
max_sequence_len: 50
+ 6 more
hidden_units: 64
num_blocks: 2
num_heads: 2
dropout_rate: 0.5
l2_regularization: 0
sample_size_m: 500 (for SCE)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. LLM-based (P5, LlamaRec): Traditional models with CE/SCE are smaller, faster, and achieve higher accuracy
vs. Sampled Softmax: SCE includes a scaling factor to correct the normalization term, improving tightness to ranking metrics [not cited in paper]

Limitations

SCE approximation quality depends on the uniformity of item distribution (though theoretical justification assumes uniform sampling)
Analysis is limited to ID-based sequential recommendation; does not cover multi-modal or text-rich scenarios where LLMs might have an edge
Experiments limited to two datasets (Beauty, Yelp)

Reproducibility

Code: https://github.com/MTandHJ/CE-SCE-LLMRec

Code is publicly available. Datasets (Beauty, Yelp) are standard public benchmarks. Implementation details for baselines and proposed method are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Next-item prediction on sequential interaction data

Benchmarks:

Beauty (Sequential Recommendation)
Yelp (Sequential Recommendation)

Metrics:

NDCG@5
NDCG@10
MRR@5
MRR@10
Statistical methodology: Reported average over 5 independent runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison showing traditional models with standard losses (BCE/BPR) underperform LLMs, but surpass them when using Cross-Entropy (CE).
Beauty	NDCG@5	0.0718	0.0886	+0.0168
Yelp	NDCG@5	0.0543	0.0745	+0.0202
Evaluation of the proposed Scaled Cross-Entropy (SCE) loss against full CE and Sampled Softmax (SSM).
Beauty	NDCG@10	0.0634	0.1035	+0.0401

Experiment Figures

Performance (NDCG@10) of SASRec on Beauty dataset as a function of the truncation parameter eta in a truncated CE loss.

Main Takeaways

The perceived superiority of LLM-based recommenders is largely due to unfair loss function comparisons (CE vs BCE/BPR).
Traditional models (SASRec, FMLP-Rec) possess sufficient capacity for sequential recommendation and outperform LLMs when trained with CE.
Tightness (approximation of ranking metrics) and Coverage (exposure to negatives) are critical properties for recommendation losses.
Scaled Cross-Entropy (SCE) is a highly effective alternative to full softmax, maintaining performance with significantly reduced computational cost.

📚 Prerequisite Knowledge

Prerequisites

Understanding of sequential recommendation models (SASRec, BERT4Rec)
Familiarity with loss functions: Binary Cross Entropy (BCE), Bayesian Personalized Ranking (BPR), and Cross-Entropy (CE)
Basic knowledge of Large Language Model fine-tuning for recommendation

Key Terms

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

MRR: Mean Reciprocal Rank—the average of the reciprocal ranks of the first relevant item

BCE: Binary Cross Entropy—a pointwise loss function often used with negative sampling (1 positive vs 1 negative)

BPR: Bayesian Personalized Ranking—a pairwise loss function that optimizes the relative order of positive and negative items

CE: Cross-Entropy—a loss function typically requiring a full softmax over all items, common in LLM training

tightness: How closely a loss function acts as a lower bound/proxy for discrete ranking metrics like NDCG

coverage: The extent to which the loss function exposes the model to a wide range of negative items during training

SCE: Scaled Cross-Entropy—the proposed loss function that scales the sampled normalization term to improve tightness without full softmax