LLMs for User Interest Exploration in Large-scale Recommendation Systems

📝 Paper Summary

Recommender Systems Large Language Models (LLMs) for Recommendation

A hybrid framework combines LLM-based high-level interest planning with classic item-level recommendation models to explore novel user interests while maintaining industrial scalability.

Core Problem

Traditional recommendation systems reinforce feedback loops by recommending items similar to past behavior, limiting the discovery of novel interests and leading to content fatigue.

Why it matters:

Strong feedback loops in classic systems prevent users from discovering diverse content, reducing long-term engagement
Directly using LLMs for large-scale recommendation is prohibitively expensive (latency/cost) and lacks domain-specific knowledge of rapidly evolving item corpuses
Off-the-shelf LLMs fail to capture domain-specific collaborative signals and user behavior patterns required for effective personalization

Concrete Example: A user who watches cooking videos keeps getting cooking recommendations due to feedback loops. An LLM might suggest 'travel vlogs' as a novel interest, but without grounding, it cannot efficiently map this broad concept to specific, high-quality video items in a corpus of billions.

Key Novelty

Hybrid Hierarchical Planning with Interest Clusters

Use LLMs as a high-level planner to generate 'interest clusters' (novel topics) based on user history, rather than predicting individual items directly
Ground these high-level interests by restricting a classic transformer-based recommender to only select items from the LLM-predicted clusters
Represent user history and future interests using 'cluster descriptions' (keywords) rather than item IDs to enable efficient offline LLM inference via lookup tables

Architecture

The Hybrid Hierarchical Planning framework. It illustrates the separation between High-level Language Policy (LLM) and Low-level Item Policy (Classic Model).

Evaluation Highlights

Live experiments on a commercial platform with billions of users showed a significant increase in exploration of novel interests
Fine-tuned LLMs achieved >99% match rate in generating valid interest cluster descriptions after ~2,000 steps
Diversified data curation for fine-tuning eliminated long-tail distribution issues, ensuring broader interest exploration compared to random sampling

Breakthrough Assessment

8/10

Successfully deploys LLMs in a billion-user industrial system by solving the latency bottleneck via offline cluster planning. A practical architectural bridge between LLM reasoning and classic ID-based recommendation.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation with a focus on exploration (predicting novel interests outside recent history)

Inputs: Sequence of user's historically consumed items (mapped to interest clusters)

Outputs: List of recommended items belonging to a novel interest cluster

Pipeline Flow

Offline: Item Clustering → LLM Fine-tuning → Bulk Inference of Cluster Transitions
Online: User History Mapping → Interest Cluster Lookup → Constrained Item Retrieval

System Modules

Item Clustering (Offline)

Group items into traffic-weighted, topically coherent clusters to reduce the planning space

Model or implementation: Hierarchical clustering on 256-dim item embeddings

High-level Language Policy (LLM)

Generate the next novel interest cluster description based on user's recent history

Model or implementation: Fine-tuned LLM (specific architecture not named, likely internal Google model)

Low-level Item Policy

Retrieve specific items that fall within the predicted novel cluster

Model or implementation: Transformer-based sequence recommender (classic ID-based model)

Novel Architectural Elements

Hybrid interface where LLM outputs 'constraints' (clusters) for a classic recommender rather than items
Offline pre-computation of all possible state transitions (cluster pairs → novel cluster) to bypass online LLM latency
Representation of user state using 'cluster descriptions' instead of item IDs to enable semantic reasoning by LLMs

Modeling

Base Model: Large Language Model (specific architecture not disclosed, likely PaLM-family)

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Ensure LLM generates valid cluster descriptions.

Formally: Standard language modeling loss on curated cluster transition pairs.

Training Data:

250K successful novel interest transitions from logs
Filtered to top-10 most frequent context pairs per label to ensure diversity
Final dataset: 7,610 samples (761 clusters * 10 samples each)

Key Hyperparameters:

fine_tuning_steps: 3000
batch_size: 16
history_length_K: 2 (clusters)

Compute: Batch inference for 579,121 cluster pairs takes 'a few hours'

Comparison to Prior Work

vs. Classic Recommendation: Introduces explicit exploration mechanism via LLM world knowledge, breaking local feedback loops
vs. Pure LLM Recommenders: Decouples reasoning (LLM) from retrieval (Classic Model), solving the latency and corpus-freshness problems of direct LLM usage
vs. Search-based grounding: Uses personalized sequence models to retrieve items within clusters, rather than generic search engines [not cited in paper]

Limitations

Relies on offline pre-computation of all cluster transitions, which scales quadratically with cluster count (feasible only for small K)
Coarse cluster-level planning (K=2) discards fine-grained sequential signal compared to item-level models
Requires high-quality hierarchical clustering of the item corpus as a prerequisite
No statistical significance tests reported for the live experiment results

Reproducibility

No replication artifacts mentioned in the paper. The system is deployed on a proprietary industrial platform (Google/YouTube implied). Data, code, and model weights are not available.

📊 Experiments & Results

Evaluation Setup

Live A/B testing on a large-scale commercial recommendation platform

Benchmarks:

Live Industrial Platform (Video Recommendation)

Metrics:

novelty exploration (increase in novel interest consumption)
user enjoyment (active users, dwell time)
LLM match rate (percentage of outputs matching valid clusters)
recall (alignment with user behavior)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Offline analysis of LLM fine-tuning shows the model learns to format outputs and align with user behavior.
Internal Log Data	Match Rate	0	99	+99
Internal Log Data	Recall (Test Set)	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Distribution of generated interest clusters under different fine-tuning data strategies (random vs. diversified).

Training curves showing 'Match Rate' and 'Recall' over fine-tuning steps.

Main Takeaways

Diversified SFT data is critical: Random sampling leads to mode collapse (recommending only popular clusters), while balanced sampling ensures broad coverage of interest space.
Live experiments confirm that the hybrid approach successfully increases both the consumption of novel interests and overall platform engagement (active users, dwell time).
Hierarchical planning effectively bridges the gap between LLM reasoning capabilities and the strict latency requirements of industrial systems.

📚 Prerequisite Knowledge

Prerequisites

Understanding of sequential recommendation (e.g., Transformer-based models)
Knowledge of Large Language Models (LLMs) and fine-tuning
Basic concepts of clustering and vector embeddings

Key Terms

interest clusters: Groups of topically coherent items created by clustering item embeddings; used as the unit of planning for the LLM

controlled generation: Fine-tuning the LLM to generate outputs that exactly match a predefined set of cluster descriptions (keywords), ensuring valid mapping to item space

SFT: Supervised Fine-Tuning—training the LLM on curated examples of successful user interest transitions to align it with domain-specific behaviors

feedback loop: The phenomenon where recommender systems reinforce existing user preferences by only showing familiar content, suppressing discovery

hierarchical planning: A strategy where decisions are made at multiple levels of abstraction—here, choosing a topic cluster first (high level), then specific items within it (low level)