ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models

📝 Paper Summary

Content-based Recommendation LLM for Recommendation

ONCE synergizes open-source LLMs as fine-tunable content encoders and closed-source LLMs as data augmenters to achieve state-of-the-art performance in content-based recommendation.

Core Problem

Existing content encoders (CNNs, BERT) struggle with deep semantic understanding and lack external knowledge, leading to inaccurate similarity measurements between items with superficial textual overlaps.

Why it matters:

Traditional encoders relying on word overlap fail to distinguish distinct concepts (e.g., 'The Lion King' vs. 'The Lions of Al-Rassan'), hurting recommendation accuracy.
Small PLMs (Pretrained Language Models) like BERT lack the world knowledge and capacity to model complex user interests effectively.
Directly using closed-source LLMs as recommenders (via prompting) often underperforms specialized models due to high latency and lack of collaborative signal integration.

Concrete Example: When encoding book titles, traditional models judge 'The Lion King' (Disney movie) and 'The Lions of Al-Rassan' (historical fantasy) as highly similar due to the word 'Lion'. LLaMA, possessing rich world knowledge, correctly encodes their distinct genres, placing 'The Lions of Al-Rassan' closer to 'The Summer Tree' (another fantasy novel) despite less lexical overlap.

Key Novelty

ONCE Framework (Open- and Closed-source LLMs)

DIRE (Discriminative Recommendation): Replaces traditional content encoders with open-source LLMs (LLaMA), freezing lower layers and fine-tuning top layers to create dense content embeddings.
GENRE (Generative Recommendation): Uses closed-source LLMs (GPT-3.5) to synthesize rich training data (summaries, inferred user profiles, synthetic history) to overcome data sparsity and enrich semantics.
Synergy: Data generated by GENRE accelerates the training convergence of DIRE and improves its final recommendation performance.

Architecture

Overview of the ONCE framework pipeline integrating DIRE and GENRE components.

Evaluation Highlights

Achieves up to +19.32% relative improvement over state-of-the-art BERT-based baselines on MIND news recommendation.
ONCE (LLaMA-13B) reaches the performance of a standard LLaMA-13B model's 8th epoch in just 6 epochs (+25% training speed) when using GPT-generated data.
Consistently outperforms baselines on Goodreads book recommendation, boosting nDCG@1 by over 5 points compared to BERT-12L.

Breakthrough Assessment

8/10

Significantly advances content-based recommendation by successfully integrating LLMs into the pipeline in two distinct, complementary roles (encoder vs. augmenter), yielding substantial empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Content-based recommendation where the goal is to predict the probability of a user u clicking on a candidate item n, given browsing history h(u).

Inputs: User browsing history (sequence of clicked items), Candidate item features (Title, Abstract, Category).

Outputs: Click probability score y.

Pipeline Flow

Data Augmentation (GENRE): GPT-3.5 generates summaries, profiles, and synthetic history.
Content Encoding (DIRE): LLaMA processes augmented text -> Frozen Lower Layers -> Cached States -> Tuned Top Layers -> Embeddings.
Recommendation Head: Attention Fusion -> User/Item Representation -> Click Prediction.

System Modules

GENRE Data Augmenter

Enrich dataset with external knowledge before training

Model or implementation: GPT-3.5 (via API)

DIRE Content Encoder

Encode textual content into dense vector representations

Model or implementation: LLaMA-7B or LLaMA-13B

Attention Fusion Layer

Compress sequence of hidden states into a single content vector

Model or implementation: Linear projection + Additive Attention

Recommender Backbone

Model user history and predict click probability

Model or implementation: Standard architectures (NRMS, NAML, Fastformer)

Novel Architectural Elements

Replacement of standard word-embedding/CNN encoders with partial-frozen LLMs (DIRE).
Natural Concator: Using natural language templates to combine multi-field inputs instead of special separator tokens.
Injection of LLM-inferred user profiles (topics, regions) directly into the user representation via an MLP interest-aware user vector.

Modeling

Base Model: LLaMA-7B and LLaMA-13B (Open-source); GPT-3.5 (Closed-source)

Training Method: Supervised Fine-tuning (Recommendation Task)

Objective Functions:

Purpose: Minimize prediction error for user clicks.

Formally: Cross-entropy loss on the click probability prediction.

Adaptation: LoRA (Low-Rank Adaptation) on top k Transformer layers (k=1 or 2 typically)

Trainable Parameters: Top k layers of LLaMA + Recommender Head parameters (Attention/MLP)

Key Hyperparameters:

learning_rate_mind: 1e-3 (Adam)
learning_rate_goodreads: 1e-4 (Adam)
learning_rate_llama_no_lora: 1e-5
+ 3 more
batch_size: Not reported in the paper
embedding_dimension: 64 (non-LLM modules)
negative_sampling_ratio: 4

Compute: Single NVIDIA A100 (80GB) for LLaMA experiments; NVIDIA RTX 3090 for non-LLM baselines.

Comparison to Prior Work

vs. PLM-NR: ONCE uses billion-scale LLMs (LLaMA) instead of BERT, and incorporates GPT-generated data augmentation.
vs. NRMS/NAML: ONCE replaces the shallow word embedding/CNN encoder with a deep LLM encoder.
vs. Prompt-based Recommendation (e.g., ChatGPT directly): ONCE uses the LLM as a feature encoder/augmenter within a traditional ranking pipeline, avoiding the high latency and poor ranking performance of direct prompting [not cited in paper as specific method name, but discussed as general approach].

Limitations

High computational cost compared to small PLMs (BERT) or CNNs, despite caching optimizations.
Reliance on closed-source APIs (GPT-3.5) for data augmentation incurs monetary cost.
Fine-tuning LLaMA-13B was noted to be more difficult than LLaMA-7B on the MIND dataset.

Reproducibility

Code: https://github.com/Jyonn/ONCE

Code and LLM-generated data available at https://github.com/Jyonn/ONCE. Detailed hyperparameters for base models and LLM tuning (learning rates, layer freezing strategy) provided.

📊 Experiments & Results

Evaluation Setup

Click-through rate prediction on standard datasets.

Benchmarks:

MIND (News Recommendation)
Goodreads (Book Recommendation)

Metrics:

AUC (Area Under Curve)
MRR (Mean Reciprocal Rank)
nDCG@5
nDCG@10
nDCG@1 (Goodreads only)
Statistical methodology: Results averaged over 5 independent runs; p-value < 0.01 reported for significance.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on MIND dataset using NRMS backbone shows ONCE outperforms both original and BERT-based versions.
MIND	AUC	64.08	68.74	+4.66
MIND	nDCG@10	37.42	44.37	+6.95
Performance on Goodreads dataset using NRMS backbone demonstrates larger gains due to richer semantic modeling of book titles.
Goodreads	nDCG@1	71.80	77.89	+6.09
Ablation study on LLaMA tuning layers (DIRE) shows partial tuning outperforms frozen models.
MIND	AUC	68.10	68.50	+0.40

Experiment Figures

Training curves (AUC vs. Epoch) for LLaMA-13B and ONCE models on NRMS and Fastformer.

Main Takeaways

Open-source LLMs (DIRE) provide the most significant performance boost, suggesting embedding quality is the primary bottleneck in content recommendation.
Closed-source LLM augmentation (GENRE) offers complementary gains, particularly by enriching sparse data (content summarizer) and modeling user interests (user profiler).
Synergy: Using GENRE data to train DIRE models (ONCE) accelerates training speed by ~25% compared to training DIRE on raw data.
LoRA helps performance on news data (MIND) but hinders it on book data (Goodreads), where full fine-tuning of top layers is preferred.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (attention mechanisms)
Content-based filtering concepts
Pre-training vs. Fine-tuning paradigms

Key Terms

DIRE: Discriminative Recommendation—using open-source LLMs (like LLaMA) as encoders to generate embeddings for classification/ranking.

GENRE: Generative Recommendation—using closed-source LLMs (like GPT-3.5) to generate synthetic text/data to augment training sets.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that injects trainable low-rank matrices into frozen model weights.

AUC: Area Under the ROC Curve—a metric measuring the ability of a classifier to distinguish between positive and negative classes.

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that accounts for the position of relevant items in the recommendation list.

Warm User: A user with sufficient interaction history (defined here as >5 browsed items) to model preferences effectively.

Cold/New User: A user with very limited interaction history (<=5 items), making recommendation difficult.

Prompting: Providing natural language instructions to an LLM to guide its output generation.

Caching: Pre-computing and storing the output of frozen lower layers of a model to avoid redundant computation during fine-tuning.