ONCE: Boosting Content-based recommendation with both open- and closed-source LLMs

📝 Paper Summary

LLM-based recommendation Content-based recommendation

ONCE combines open-source LLMs (fine-tuned as content encoders) and closed-source LLMs (prompted as data augmenters) to significantly enhance content-based recommendation systems.

Core Problem

Existing content-based recommenders using CNNs or small PLMs (like BERT) fail to fully comprehend deep semantic content or capture knowledge about rare entities, limiting recommendation quality.

Why it matters:

Current encoders struggle with word-level ambiguity (e.g., confusing 'Lion King' with 'Lions of Al-Rassan') and lack the external knowledge needed for accurate semantic matching
Purely prompting closed-source LLMs often underperforms traditional methods due to latency and lack of fine-grained signal, while open-source models are rarely optimized for recommendation tasks

Concrete Example: Standard encoders incorrectly rate 'The Lion King' (Disney movie) and 'The Lions of Al-Rassan' (historical fantasy) as highly similar due to word overlap. In contrast, LLaMA correctly identifies that 'The Lions of Al-Rassan' is semantically closer to 'The Summer Tree' (same author, genre).

Key Novelty

Hybrid Open-Closed LLM Synergy (ONCE)

DIRE (Discriminative Recommendation): Replaces traditional content encoders with open-source LLMs (LLaMA), fine-tuning their upper layers to produce dense embeddings optimized for recommendation
GENRE (Generative Recommendation): Uses closed-source LLMs (GPT-3.5) to generate synthetic user profiles, summaries, and personalized content, enriching the training data for the discriminative model

Architecture

The ONCE framework integrating GENRE (Prompting) and DIRE (Fine-tuning)

Evaluation Highlights

+19.32% improvement in nDCG@5 on the MIND news recommendation dataset compared to state-of-the-art baselines using ONCE
Fine-tuning LLaMA alone (DIRE) consistently yields >10% gains over BERT-based baselines across multiple metrics
Using generated data from GPT-3.5 accelerates LLaMA fine-tuning convergence by ~25-40% (reaching peak performance significantly earlier in training)

Breakthrough Assessment

8/10

Strong empirical results demonstrating how to effectively combine the complementary strengths of open weights (fine-tunability) and closed APIs (generation quality) for recommendation.

⚙️ Technical Details

Problem Definition

Setting: Content-based recommendation where the goal is to predict the probability of a user u clicking on a candidate item n

Inputs: User browsing history h(u) and candidate content item n (with text features like title, abstract, category)

Outputs: Click probability score y_hat

Pipeline Flow

Data Augmentation (GENRE): GPT-3.5 generates summaries, user profiles, and synthetic content
Content Encoding (DIRE): LLaMA processes augmented text to produce embeddings
Recommendation (DIRE): User/History encoders process embeddings to predict clicks

System Modules

Data Augmenter

Generate enriched text features and synthetic history

Model or implementation: GPT-3.5

Content Encoder

Encode text into dense vector representations

Model or implementation: LLaMA-7B / LLaMA-13B

Recommendation Head

Model user interests and predict click probability

Model or implementation: Standard recommenders (NAML, NRMS, Fastformer, MINER)

Novel Architectural Elements

Integration of LLM-generated inferred user profiles (topics/regions) into the user vector via pooling and MLP fusion
Replacement of standard PLM encoders with parameter-efficiently fine-tuned LLMs (LLaMA) using layer caching for efficiency

Modeling

Base Model: LLaMA-7B and LLaMA-13B (Open-source); GPT-3.5 (Closed-source)

Training Method: Supervised Fine-Tuning (DIRE component) on recommendation task

Objective Functions:

Purpose: Optimize click prediction accuracy.

Formally: Standard cross-entropy loss for classification (click vs. non-click).

Adaptation: Partial freezing (tuning top 1-2 layers) and LoRA (Low-Rank Adaptation)

Trainable Parameters: Top k transformer layers (k=1 or 2) plus LoRA parameters; remaining layers frozen

Training Data:

MIND dataset: ~94k users, ~65k content items
Goodreads dataset: ~23k users, ~16k content items
Augmented with GPT-3.5 generated summaries and profiles

Key Hyperparameters:

learning_rate_mind: 1e-3 (1e-5 without LoRA)
learning_rate_goodreads: 1e-4
batch_size: Not reported in the paper
+ 2 more
negative_sampling_ratio: 4
embedding_dimension: 64 (for non-LLM modules)

Compute: Single NVIDIA A100 (80GB) for LLaMA experiments; RTX 3090 for baselines. Training efficiency improved by caching lower layer outputs (reducing compute to ~6% of original).

Comparison to Prior Work

vs. PLM-NR: Uses much larger LLMs (7B/13B vs BERT) and combines with generative augmentation from closed-source models
vs. Standard Recommenders: Replaces shallow encoders (CNN/Attention) with deep LLM representations
vs. Direct LLM Prompting: Uses LLM as an encoder/augmenter within a traditional framework rather than as a standalone scoring agent, avoiding latency/cost issues

Limitations

High computational cost for fine-tuning even with parameter-efficient methods compared to standard ID-based or small PLM models
Reliance on closed-source APIs (GPT-3.5) for data augmentation incurs monetary cost and dependency
Performance gains on book titles (Goodreads) are less drastic without LoRA, indicating sensitivity to input text richness

Reproducibility

Code: https://github.com/Jyonn/ONCE

publicly available (https://github.com/Jyonn/ONCE). Code and LLM-generated data provided. Hyperparameters for baselines and proposed method are detailed.

📊 Experiments & Results

Evaluation Setup

Click-through rate prediction on news (MIND) and book (Goodreads) datasets

Benchmarks:

MIND (News Recommendation)
Goodreads (Book Recommendation)

Metrics:

AUC
MRR
nDCG@5
nDCG@10
Statistical methodology: Averaged over 5 independent runs; p-value < 0.01 reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on MIND dataset showing ONCE outperforms traditional and BERT-based baselines.
MIND	AUC	65.32	68.62	+3.30
MIND	nDCG@10	40.35	44.05	+3.70
MIND	nDCG@5	31.35	38.31	+6.96
Ablation on open-source LLM size and fine-tuning depth.
MIND	AUC	67.78	68.34	+0.56
Impact of synthetic data on Cold Start vs Warm Users.
MIND	AUC	59.24	60.21	+0.97

Experiment Figures

Training curves (AUC vs Epoch) for LLaMA-based models vs ONCE

Comparison of embedding spaces for three books using GloVe, BERT, and LLaMA

Main Takeaways

Open-source LLMs (LLaMA) as encoders significantly outperform BERT-based encoders, demonstrating the value of larger scale and better pretraining
Data augmentation via closed-source LLMs (GENRE) provides consistent gains, particularly for user profiling and generating synthetic history for cold-start users
Synergy exists: Data augmentation accelerates the fine-tuning convergence of the open-source encoder (learning efficient representations faster)
LLaMA-7B is often sufficient and sometimes outperforms 13B on news titles, possibly due to easier fine-tuning on shorter text

📚 Prerequisite Knowledge

Prerequisites

Basics of content-based recommendation (user/item encoders)
Transformer architecture and attention mechanisms
Concept of Low-Rank Adaptation (LoRA)

Key Terms

DIRE: Discriminative Recommendation Framework—using open-source LLMs as trainable content encoders to extract embeddings

GENRE: Generative Recommendation Framework—using closed-source LLMs to generate synthetic data (summaries, profiles) to augment training

PLM-NR: Pretrained Language Model for News Recommendation—a baseline method using BERT-sized models as encoders

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes main weights and trains small decomposition matrices

Chain-based Generation: Iterative prompting where outputs from one step (e.g., user profile generation) are used as inputs for the next (e.g., synthetic content generation)

Natural Concator: A strategy of concatenating multi-field text data using natural language templates (e.g., 'news article: <title>...') rather than special separation tokens

MIND: Microsoft News Recommendation Dataset—a large-scale benchmark for news recommendation

Goodreads: A book recommendation dataset used for evaluating content-based filtering performance

warm user: A user with more than 5 interactions in their history

cold user: A user with 5 or fewer interactions in their history

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that values correct items appearing earlier in the list