HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

📝 Paper Summary

Sequential Recommendation LLM for Recommendation

HLLM decouples recommendation into an Item LLM that compresses text into embeddings and a User LLM that models interests over these embeddings, achieving scalability and efficiency.

Core Problem

Directly inputting user history text into LLMs creates excessively long sequences, causing quadratic complexity growth and inefficiency, while traditional ID-based models struggle with cold starts and shallow modeling capabilities.

Why it matters:

LLMs' self-attention complexity scales quadratically with sequence length, making long user history text input computationally prohibitive
Existing LLM-based recommenders often fail to significantly outperform traditional methods, questioning the value of pre-trained weights
Scalability of billion-parameter models in recommendation remains under-explored compared to other domains

Concrete Example: Recommending a single item using a standard text-in/text-out LLM requires generating multiple tokens and multiple forward passes, which is slow. Additionally, representing a long history of user behaviors as raw text results in a context length far exceeding that of ID-based sequences, slowing down training and inference.

Key Novelty

Hierarchical Large Language Model (HLLM)

Decouples the task into two tiers: an Item LLM that acts as a feature extractor converting item text to embeddings, and a User LLM that processes these embeddings to predict user interests.
Uses a special token [ITEM] to compress detailed item descriptions into concise vectors, reducing the user sequence length to match efficient ID-based models while retaining semantic richness.

Architecture

The hierarchical architecture of HLLM. It shows the Item LLM processing item text to generate an embedding (via [ITEM] token) and the User LLM processing a sequence of these item embeddings to predict the next item embedding.

Evaluation Highlights

Significantly outperforms traditional ID-based models (SASRec) and LLM-based baselines on PixelRec and Amazon Reviews datasets.
Scaling the User LLM from 1B to 7B parameters yields consistent performance gains, validating scaling laws in recommendation.
Achieves state-of-the-art results with high training efficiency, surpassing ID-based models with only a small fraction of training data.

Breakthrough Assessment

8/10

Successfully validates the scaling law for billion-parameter models in recommendation and offers a practical, efficient architecture that bridges the gap between ID-based efficiency and LLM semantic capabilities.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation predicting the next item given a chronological sequence of user interactions.

Inputs: User u's history sequence U = {I_1, I_2, ..., I_n} containing item text information.

Outputs: Predicted embedding for the next item I_{n+1}.

Pipeline Flow

Item LLM: Item Text → [ITEM] Token Embedding
User LLM: Sequence of Item Embeddings → Next Item Embedding Prediction

System Modules

Item LLM

Extracts semantic features from item text descriptions

Model or implementation: Pre-trained LLM (e.g., derived from open-source weights)

User LLM

Models user interest evolution over time

Model or implementation: Pre-trained LLM (discarding word embeddings, retaining transformer layers)

Novel Architectural Elements

Hierarchical decoupling of Item and User modeling using two separate LLMs.
User LLM operates purely in embedding space (embedding-in, embedding-out) while leveraging pre-trained transformer weights.

Modeling

Base Model: Open-source LLMs (specific base model names like Llama or Mistral are implied by '7B parameters' but exact source model not explicitly named in snippet)

Training Method: Supervised fine-tuning (SFT) with recommendation objectives

Objective Functions:

Purpose: Generative recommendation (Next Item Prediction).

Formally: InfoNCE loss maximizing similarity s(E'_{j,i}, E_{j,i}) against N negative samples.
Purpose: Discriminative recommendation (Click Prediction).

Formally: Binary Cross Entropy loss L_{BCE} = -1/N * sum(y log(sigmoid(x)) + (1-y) log(1-sigmoid(x))).
Purpose: Auxiliary task for discriminative models.

Formally: L = L_{BCE} + lambda * L_{NIP}.

Adaptation: Full fine-tuning (implied by 'further fine-tuning leads to significant performance boosts' and discussion of pre-trained weights)

Trainable Parameters: Up to 7B for both Item and User LLMs

Training Data:

PixelRec (large-scale)
Amazon Reviews (large-scale)

Key Hyperparameters:

model_size: Up to 7B parameters

Compute: Not reported in the paper

Comparison to Prior Work

vs. SASRec: HLLM uses text-derived embeddings via LLMs rather than learned ID embeddings, enabling better cold-start handling and semantic understanding.
vs. Direct Text-Input LLMs (e.g., TallRec): HLLM compresses items to embeddings first, avoiding quadratic complexity of long text sequences.
vs. LLaRA: HLLM uses a hierarchical two-LLM structure rather than hybrid prompting, and User LLM operates purely on embeddings.

Limitations

The paper snippet does not explicitly list limitations, but potential issues include high memory footprint of deploying two 7B models.
Requires fine-tuning; pre-trained weights alone are insufficient without task-specific adaptation.
Inference cost is higher than lightweight ID-based models despite being more efficient than full text-based LLMs.

Reproducibility

Code: https://github.com/bytedance/HLLM

Codes available at https://github.com/bytedance/HLLM. Datasets (PixelRec, Amazon Reviews) are standard or described. Specific base LLM checkpoints (e.g. Llama-2) not explicitly named in text snippet but implied by 7B size.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on large-scale datasets.

Benchmarks:

PixelRec (Sequential Recommendation)
Amazon Reviews (Sequential Recommendation)

Metrics:

Not explicitly listed in snippet (likely Recall@K, NDCG@K based on standard practices)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided text snippet does not contain specific result tables with numerical values. It claims state-of-the-art results and scalability but does not provide the exact metrics (e.g., NDCG, Recall) or values.

Experiment Figures

Strategies for discriminative recommendation using User LLM variants: Early Fusion (concatenating target item) and Late Fusion (separate user/item processing).

Main Takeaways

HLLM outperforms traditional ID-based models (like SASRec) significantly on large datasets.
Pre-trained weights from LLMs are valuable for recommendation but require fine-tuning to be effective.
The architecture scales effectively: increasing User LLM size from smaller sizes up to 7B parameters yields consistent performance improvements.
HLLM is more training-efficient than ID-based models, achieving better performance with less data.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation
Transformer architecture (Self-Attention)
Large Language Models (LLMs)

Key Terms

HLLM: Hierarchical Large Language Model—the proposed two-tier architecture separating item feature extraction from user interest modeling.

Item LLM: The first tier of HLLM that converts raw item text descriptions into dense vector embeddings using a special [ITEM] token.

User LLM: The second tier of HLLM that takes a sequence of item embeddings (from Item LLM) and predicts the embedding of the next item of interest.

InfoNCE loss: A contrastive loss function used to maximize the similarity between the predicted embedding and the ground-truth item embedding while minimizing similarity with negative samples.

Cold-start: A scenario where the system must recommend items or serve users with little to no prior interaction history.

Scaling law: The observation that model performance improves primarily as a power-law function of model size, dataset size, and compute.

SASRec: Self-Attentive Sequential Recommendation—a baseline model that uses a Transformer encoder to model user sequential behavior based on Item IDs.

ID-based models: Traditional recommendation models that represent users and items as unique numerical IDs mapped to learned embeddings.

Generative recommendation: A paradigm where the model generates the representation (or ID/text) of the next item directly.

Discriminative recommendation: A paradigm where the model scores a specific user-item pair to predict the likelihood of interaction (e.g., click/no-click).