Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

📝 Paper Summary

Recommendation Systems Sequential Recommendation Loss Function Analysis

The paper demonstrates that the perceived superiority of LLM-based recommenders stems from unfair loss function comparisons and proposes Scaled Cross-Entropy (SCE) to enable conventional models to achieve comparable or superior performance efficiently.

Core Problem

LLM-based recommenders are often compared to conventional models (like SASRec) that are trained with inferior losses (BCE/BPR) rather than the Cross-Entropy loss used by LLMs, creating an illusion of LLM superiority.

Why it matters:

Researchers are possibly over-estimating LLM capabilities and under-estimating conventional methods due to unfair benchmarking
Full Cross-Entropy is computationally intractable for large item sets in production, necessitating efficient but accurate approximations
Existing approximations like Noise Contrastive Estimation (NCE) suffer from slow convergence and weak bounds in early training

Concrete Example: In current benchmarks, SASRec trained with BCE/BPR underperforms against LlamaRec. However, the paper shows SASRec trained with full Cross-Entropy actually outperforms LlamaRec, proving the gap is due to the loss function, not the model architecture.

Key Novelty

Scaled Cross-Entropy (SCE) for Recommendation

Theoretically proves that minimizing Cross-Entropy (CE) maximizes a lower bound of ranking metrics (NDCG and RR), explaining why CE is superior to BCE/BPR for ranking
Identifies that standard Noise Contrastive Estimation (NCE) provides a weak bound during early training, leading to slow convergence
Proposes SCE: a sampled softmax loss where the negative term is scaled up by a weight factor to maintain a tight bound on the ranking metric even with few samples

Architecture

Comparison of ranking performance between SASRec trained with different losses (CE, BCE, BPR) and LLM-based methods

Evaluation Highlights

SASRec trained with Cross-Entropy outperforms LLM-based methods by a large margin (Figure 1 qualitative result)
Scaled Cross-Entropy (SCE) with only 100 negative samples and scaling factor 100 achieves comparable performance to standard sampled softmax with 500 samples on the Beauty dataset
Standard NCE requires ~150 epochs to converge on Beauty, while NEG (Negative Sampling) converges in ~70 epochs, highlighting NCE's training difficulties

Breakthrough Assessment

7/10

Provides a critical correction to the evaluation methodology of LLMs in RecSys. While the architectural contribution is a loss modification, the impact on fair benchmarking is significant.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation (Next-item prediction)

Inputs: Sequence of historical item interactions q = [v_1, v_2, ..., v_t]

Outputs: The next item v_{t+1} from the item set I

Pipeline Flow

Input Sequence Processing (Embeddings)
Sequential Modeling (SASRec/Transformer)
Score Calculation (Dot Product)
Loss Computation (Scaled Cross-Entropy)

System Modules

Sequential Encoder

Encodes user interaction history into a latent vector

Model or implementation: SASRec

Loss Function

Calculates error gradients to update the encoder

Model or implementation: Scaled Cross-Entropy (SCE)

Novel Architectural Elements

Scaled Cross-Entropy (SCE) Loss: Modifies sampled softmax by multiplying the sampled normalizing term by a weight alpha to mitigate magnitude loss and improve ranking bounds

Modeling

Base Model: SASRec (used as the primary conventional baseline to demonstrate the effect of loss functions)

Training Method: Supervised learning on interaction sequences

Objective Functions:

Purpose: Approximate full cross-entropy efficiently while maintaining tight ranking bounds.

Formally: L_SCE = -log(e^s_pos / (e^s_pos + alpha * sum(e^s_neg)))

Training Data:

Beauty (Amazon)
Sports (Amazon)
Toys (Amazon)
MovieLens-1M

Key Hyperparameters:

alpha: 100 (scaling weight for SCE)
K: 100 or 500 (number of negative samples)
c: 10 (constant estimate for NCE variants)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. LlamaRec/E4SRec: Demonstrates that simple SASRec with proper loss (SCE/CE) outperforms these complex LLM methods
vs. Sampled Softmax: SCE adds a scaling factor alpha to correct the magnitude of the denominator, providing better gradients for ranking

Limitations

The theoretical analysis assumes the existence of an ideal scoring function class
Scaling factor alpha introduces a high variance problem if the number of negative samples K is too small (e.g., < 10)
Comparison focuses heavily on Next-Item Recommendation, may not apply to other recsys tasks

Reproducibility

Code: https://github.com/MTandHJ/CE-SCE-LLMRec

Code is publicly available at https://github.com/MTandHJ/CE-SCE-LLMRec. The paper provides theoretical proofs in appendices (implied) and specifies hyperparameters for the proposed loss (alpha=100).

📊 Experiments & Results

Evaluation Setup

Next-item prediction on sequential interaction data

Benchmarks:

Beauty (Sequential Recommendation)
Sports (Sequential Recommendation)
Toys (Sequential Recommendation)
MovieLens-1M (Sequential Recommendation)

Metrics:

NDCG@10
Reciprocal Rank (RR)
Statistical methodology: Results summarized based on 5 independent runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Convergence analysis demonstrates that NCE suffers from training difficulties compared to Negative Sampling (NEG) and the proposed methods.
Beauty	Epochs to Convergence	150	70	-80
Efficiency analysis of the proposed Scaled Cross-Entropy (SCE) shows it achieves comparable performance with fewer samples.
Beauty	Negative Samples Required	500	100	-400

Experiment Figures

Impact of dynamic truncation (eta) on NDCG@10 performance for SASRec

Training curves (NDCG@10 vs Epochs) for NCE and NEG with different numbers of negative samples

Main Takeaways

Minimizing cross-entropy is theoretically equivalent to maximizing a lower bound of NDCG and RR, explaining its superiority over BCE/BPR
Standard NCE with constant c=1 suffers from 'training difficulties' where sampling more negatives delays convergence due to weak bounds in early training
Proposed Scaled Cross-Entropy (SCE) effectively approximates full softmax by scaling up the sampled normalizing term (e.g., alpha=100), allowing high performance with few negative samples
Conventional models (SASRec) trained with CE/SCE are competitive with or superior to LLM-based recommenders, suggesting previous claims of LLM superiority were due to unfair loss function comparisons

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (Collaborative Filtering)
Loss functions (Cross-Entropy, BCE, BPR)
Sequential models (Transformers/SASRec)

Key Terms

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that gives more credit to correct items appearing at the top of the list

RR: Reciprocal Rank—a ranking metric that is the inverse of the rank of the first correct item (1/rank)

BCE: Binary Cross-Entropy—a loss function treating recommendation as independent binary classification tasks (relevant vs. not relevant)

BPR: Bayesian Personalized Ranking—a pairwise loss function that optimizes the relative order of positive and negative items

NCE: Noise Contrastive Estimation—a method to approximate the normalizing term in softmax by discriminating observed data from noise

SCE: Scaled Cross-Entropy—the paper's proposed loss function which scales up the sampled denominator term to approximate full softmax better

SASRec: Self-Attentive Sequential Recommendation—a standard Transformer-based baseline model for sequential recommendation