User Feedback Alignment for LLM-powered Exploration in Large-scale Recommendation Systems

📝 Paper Summary

Recommender Systems LLM alignment

This paper decouples novelty generation from user preference alignment by using two specialized LLMs—one for exploring new interests and one for ranking them based on collective feedback—improved via inference-time sampling.

Core Problem

Exploration in recommender systems is difficult because implicit feedback loops reinforce existing preferences, and aligning LLMs to be both novel and relevant is unstable (catastrophic forgetting or reward hacking).

Why it matters:

Traditional collaborative filtering reinforces feedback loops, limiting users to their established interests and hurting long-term engagement
Standard RLHF fails for exploration: reward models hack the objective (predicting popular words like 'cat' or 'BTS') and lose the ability to generate structured, diverse plans
Balancing novelty and relevance is a competing objective; optimizing for one often degrades the other in a single model

Concrete Example: When using standard RLHF to align a novelty-seeking LLM, the model collapsed after 5k steps: its adherence to output format dropped from >99% to 2%, and it began spamming high-reward terms like 'toys' instead of generating valid interest clusters.

Key Novelty

Decoupled Dual-LLM Exploration with Inference Scaling

Separates the problem into two models: a 'Novelty Model' (policy) that generates diverse candidate interests, and an 'Alignment Model' (reward) trained on collective user feedback to score them
Uses inference-time scaling (generating many candidates at high temperature) to find options that satisfy both novelty and relevance, rather than forcing one model to learn both simultaneously

Architecture

The Hierarchical Planning Paradigm. It illustrates the flow from User History -> LLM Novelty Prediction -> Novel Interest Clusters -> Backbone Recommender -> Final Items.

Evaluation Highlights

Significant gains in user satisfaction (measured by watch activity) and active user counts in live experiments on a platform with billions of users
Improved offline ranking metrics (F1@K and NDCG@K) against ground-truth user feedback compared to random baselines
Outperforms production baselines including hierarchical contextual bandits and neural linear bandits in both novelty and quality metrics

Breakthrough Assessment

8/10

Highly practical solution deployed at massive scale (YouTube). Effectively solves the 'reward hacking' problem in RLHF for recommendation by decoupling generation and evaluation.

⚙️ Technical Details

Problem Definition

Setting: Next novel interest cluster prediction given a sequence of user's historical interest clusters

Inputs: Sequence of K historical interest clusters S_u representing the user's recent interaction history

Outputs: A predicted novel interest cluster C_n that is outside the user's current history but relevant

Pipeline Flow

Novelty Model (generates candidate clusters)
Alignment Model (scores candidates)
Selection Strategy (picks top-k clusters)
Downstream Recommender (serves items from clusters)

System Modules

Novelty Model

Generate diverse, novel interest cluster candidates based on user history

Model or implementation: Gemini (fine-tuned on interest transitions)

Alignment Model

Predict the probability of user engagement for a given history-to-future cluster transition

Model or implementation: Gemini (LLM with linear projection head)

Selector

Select the best-of-N clusters to serve

Model or implementation: Ranking logic

Novel Architectural Elements

Decoupling of exploration (Novelty Model) and exploitation/relevance (Alignment Model) into separate LLMs to prevent objective conflict
Integration of inference-time scaling (Best-of-N) specifically for recommender system exploration

Modeling

Base Model: Gemini

Training Method: Supervised Fine-Tuning (Novelty Model) and Reward Modeling (Alignment Model)

Objective Functions:

Purpose: Train alignment model to predict user engagement.

Formally: Cross-entropy loss between prediction and aggregated user feedback score (e.g., positive playback rate).

Training Data:

Novelty Model: <8k examples of novel interest transitions mined from user history
Alignment Model: Aggregated feedback tuples ({C_1...C_K}, C_n, L) where L is the collective feedback score (like rate, etc.) for that transition

Key Hyperparameters:

history_length_K: 2
inference_samples_N: 5x more predictions than baseline
alignment_training_steps_favorable: 50,000

Compute: Offline bulk inference (amortized cost); no latency impact on live serving as predictions are pre-computed

Comparison to Prior Work

vs. Hierarchical Contextual Bandit: Uses LLM world knowledge for transitions rather than just historical click correlations
vs. RLHF: Decouples the generation and reward models entirely; uses inference scaling instead of updating the policy weights to avoid mode collapse/reward hacking
vs. PIE (Personalized Interest Exploration) [not cited in paper]: PIE uses bandits for creator affinity; this method uses LLMs for semantic cluster transitions

Limitations

Relies on offline pre-computation, limiting real-time adaptation to immediate user actions
Feedback signal is aggregated collectively, potentially missing niche individual preferences
Current implementation limited to short history length (K=2)
Requires serving two separate LLMs (though amortized offline)

Reproducibility

No code or data provided. The system is deployed on a proprietary commercial platform (YouTube). Architecture is described but implementation details rely on internal infrastructure.

📊 Experiments & Results

Evaluation Setup

Live A/B testing on a commercial short-form video platform (billions of users) and offline metric evaluation

Benchmarks:

Live Production Traffic (Recommendation)

Metrics:

User Satisfaction (Watch activity / Positive playback rate)
Active User Counts
Exploration Diversity (Novelty)
Offline: F1@K
Offline: NDCG@K
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Holdout Set	F1@K	Not reported in the paper	Not reported in the paper	Not reported in the paper
Live Production Traffic	User Satisfaction (Positive Playback Rate)	Not reported in the paper	Not reported in the paper	Positive gain

Experiment Figures

Offline performance metrics (F1 and NDCG) of the Alignment Model during training steps.

A scatter plot comparing 'Novelty' (x-axis) vs 'Quality' (y-axis) for different production models relative to a Bandit baseline.

Main Takeaways

Decoupling novelty and alignment prevents the 'reward hacking' observed in standard RLHF, where models collapse to predicting generic popular terms.
Inference-time scaling (generating 5x candidates and ranking) is effective for balancing competing objectives (novelty vs. relevance) without retraining the policy model.
Offline F1@K is a better proxy for online user satisfaction than NDCG for this task, as the goal is to find *any* good novel cluster (top-k set) rather than ranking them perfectly.
The system outperforms both exploration-oriented baselines (Bandits) and exploitation-oriented baselines (Sequential models) in live experiments.

📚 Prerequisite Knowledge

Prerequisites

Hierarchical planning in recommender systems
Reinforcement Learning from Human Feedback (RLHF)
Best-of-N sampling (Rejection Sampling)

Key Terms

Interest Cluster: A topically coherent group of items generated from item metadata and content embeddings

Hierarchical Planning: A paradigm where an LLM plans high-level user interests (clusters) while a traditional recommender retrieves specific items within those clusters

Inference-time Scaling: Improving model performance by generating multiple outputs and selecting the best one, rather than just training a better model

Alignment Model: A specific LLM trained to act as a reward model, scoring how likely a user is to engage with a predicted interest cluster

Pointwise Training: Training a model to predict a score for a single input item, as opposed to ranking a pair of items

Feedback Loop: The phenomenon where a system recommends what a user likes, the user clicks it, and the system learns to recommend only that type of content, narrowing the user's experience