GLoSS: Generative Language Models with Semantic Search for Sequential Recommendation

📝 Paper Summary

Sequential Recommendation Generative Recommendation

GLoSS combines efficient quantized LLMs for query generation with dense semantic retrieval to recommend items based on textual meaning rather than just ID matching.

Core Problem

Traditional ID-based recommenders struggle with cold-start items and lack generalization, while prior generative approaches often rely on outdated models or lexical matching (BM25) that misses semantic context.

Why it matters:

ID-based methods require retraining for new items and cannot leverage rich metadata descriptions
Existing generative methods like GPT4Rec use lexical matching (BM25), failing to capture the semantic intent of generated queries
Full fine-tuning of LLMs for recommendation is computationally expensive and resource-intensive

Concrete Example: A user buys a 'Cosplay Hair Wig'. A lexical system might only find items with exact word overlaps like 'Hair Wig'. GLoSS generates a descriptive query for the next likely purchase (e.g., 'Spiral Curly Cosplay Wig') and uses semantic search to find conceptually similar items even if the exact words differ.

Key Novelty

Generative Low-rank language model with Semantic Search (GLoSS)

Replaces the lexical matching (BM25) used in prior works like GPT4Rec with dense semantic search, allowing retrieval of items based on meaning rather than just keyword overlap
Uses modern LLaMA-3 models fine-tuned via 4-bit QLoRA, enabling high-quality query generation on consumer-grade hardware compared to older, compute-heavy backbones like GPT-2 or T5

Architecture

The GLoSS inference pipeline: from user history serialization to LLM query generation and dense item retrieval.

Evaluation Highlights

+52.8% Recall@5 improvement on Amazon Toys compared to the best performing ID-based baseline (TIGER)
+33.3% Recall@5 improvement on Amazon Beauty compared to ID-based baselines
+29.5% Recall@5 improvement on Amazon Sports compared to E4SRec (LLM-based baseline)

Breakthrough Assessment

8/10

Significant double-digit gains over SOTA baselines by modernizing the generative retrieval pipeline with LLaMA-3 and dense search. Demonstrates high effectiveness for cold-start users.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation: Predict the next item in a sequence based on historical user interactions and item attributes.

Inputs: Chronologically ordered sequence of item descriptions (e.g., titles) {i_1, ..., i_{n-1}}

Outputs: The n-th item likely to be interacted with, retrieved from the catalog

Pipeline Flow

History Serialization (Items to Text)
Query Generation (Fine-tuned LLM)
Semantic Retrieval (Dense Encoder)

System Modules

Generator

Generate a natural language description of the next likely item based on user history

Model or implementation: LLaMA-3 (1B, 3B, or 8B) quantized to 4-bit

Retriever

Encode the generated query and retrieve the most semantically similar items from the catalog

Model or implementation: e5-small-v2 (Dense Encoder)

Novel Architectural Elements

Integration of dense semantic retrieval (e5-small-v2) directly on LLM-generated queries, replacing the standard BM25 lexical matching used in similar generative pipelines (GPT4Rec)

Modeling

Base Model: LLaMA-3 (1B, 3B, 8B variants)

Training Method: Supervised Fine-Tuning (SFT) for next-token prediction on serialized item sequences

Adaptation: QLoRA (4-bit quantization, rank=16, alpha=16)

Trainable Parameters: 11M (1B model), 24M (3B model), 42M (8B model)

Training Data:

Amazon Beauty, Toys, and Sports datasets
Leave-last-out split strategy

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 16 (effective)
epochs: 10
+ 5 more
context_length: 1024
beam_size: 5
max_new_tokens: 50
weight_decay: 0.01
warmup_steps: 300

Compute: Single RTX A5000 GPU (24GB memory)

Comparison to Prior Work

vs. GPT4Rec: Uses dense semantic search instead of BM25 lexical search; uses LLaMA-3/QLoRA instead of GPT-2/Full FT
vs. LlamaRec: Performs direct retrieval using the LLM generation without a separate reranking stage (LlamaRec reranks)
vs. TIGER: Generates natural language queries for retrieval rather than learning and predicting discrete semantic IDs

Limitations

Inference latency is higher than ID-based models due to LLM generation
NDCG scores are sometimes lower than reranking-based methods (LlamaRec) due to lack of a specialized reranking stage
Requires indexing the full catalog with a dense encoder

Reproducibility

Code: https://github.com/krishnacharya/GLoSS

publicly available (https://github.com/krishnacharya/GLoSS). Includes code and model checkpoints. Uses Unsloth library for training and retriv for indexing.

📊 Experiments & Results

Evaluation Setup

Leave-last-out sequential recommendation on Amazon datasets

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Toys (Sequential Recommendation)
Amazon Sports (Sequential Recommendation)

Metrics:

Recall@5
NDCG@5
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against ID-based baselines shows GLoSS-8B consistently outperforming the best prior methods (ActionPiece, TIGER) across all datasets.
Amazon Beauty	Recall@5	0.0511	0.0681	+0.0170
Amazon Toys	Recall@5	0.0521	0.0796	+0.0275
Amazon Sports	Recall@5	0.0316	0.0364	+0.0048
Comparison against LLM-based baselines shows GLoSS achieving superior Recall, though competitive on NDCG.
Amazon Beauty	Recall@5	0.0653	0.0681	+0.0028
Amazon Toys	Recall@5	0.0648	0.0796	+0.0148
Ablation study demonstrating the superiority of dense retrieval (e5) over lexical retrieval (BM25) within the GLoSS framework.
Amazon Toys	NDCG@5	0.0472	0.0529	+0.0057

Experiment Figures

Bar charts comparing Recall@5 of the simple Last-Item Search (LIS) baseline against SASRec and TIGER.

Main Takeaways

GLoSS achieves state-of-the-art performance, outperforming both ID-based (TIGER, SASRec) and LLM-based (GPT4Rec, P5) baselines in Recall@5 across all datasets.
Dense retrieval (Semantic Search) provides substantial gains over BM25, particularly in NDCG@5 (ranking quality), confirming the value of semantic matching.
Performance scales with LLM size; 8B models generally outperform 1B and 3B variants.
Robust across user segments: performs exceptionally well for cold-start users (short history) on Toys and Sports, while benefiting from longer histories on the Beauty dataset.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation
Large Language Models (LLMs)
Dense Retrieval / Semantic Search
Parameter-Efficient Fine-Tuning (LoRA/QLoRA)

Key Terms

Dense retrieval: A search method using vector embeddings to find relevant items based on semantic meaning rather than exact keyword matches

BM25: A classic lexical retrieval algorithm that ranks documents based on keyword occurrence and frequency (sparse retrieval)

QLoRA: Quantized Low-Rank Adaptation—a technique to fine-tune large models efficiently by freezing most parameters and training only small adapters in low precision

Recall@k: The proportion of relevant items found in the top-k recommendations

NDCG@k: Normalized Discounted Cumulative Gain—a ranking metric that credits the model more for placing relevant items higher in the top-k list

Cold-start users: Users with very few interaction history, making it difficult for systems to learn their preferences

ID-based methods: Recommender systems that learn embeddings for specific item IDs, often failing to generalize to new items without retraining

Semantic search: Retrieval based on meaning and context (using embeddings) rather than just keyword matching

Beam search decoding: A text generation strategy that explores multiple likely output sequences (beams) simultaneously to find the best overall sequence

e5-small-v2: A specific dense embedding model used to convert text into vector representations for retrieval