Netflix Artwork Personalization via LLM Post-training

📝 Paper Summary

Recommendation Systems Visual-Language Personalization

The paper adapts large language models to personalize movie artwork recommendations by representing user history and visual options as text, then post-training via supervised fine-tuning with reasoning and direct preference optimization.

Core Problem

Standard recommendation systems often use a 'one-size-fits-all' image for a title, failing to appeal to diverse user tastes (e.g., users preferring romance vs. action) even when the title itself is personalized.

Why it matters:

Artworks are critical decision cues for users deciding to watch or skip content
Diverse user bases have heterogeneous preferences that a single image cannot satisfy (e.g., cultural or genre-specific preferences)
Existing LLM recommendation work focuses on title selection but neglects the visual presentation layer (artwork personalization)

Concrete Example: A movie might have both intense action and romantic subplots. A user who loves romance might skip the movie if shown an action-heavy artwork, whereas they would watch it if shown an image emphasizing the characters' relationship. The proposed system predicts this preference.

Key Novelty

Text-Based Visual Personalization via LLM Post-Training

Converts the visual personalization problem into a text-based multiple choice task by captioning artwork images and summarizing user history
Uses 'Prediction with Reasoning' where a larger teacher model (Qwen) generates justifications for ground-truth choices to supervise a smaller student model (Llama)
Applies Direct Preference Optimization (DPO) to explicitly teach the model to rank the successful artwork higher than rejected alternatives

Evaluation Highlights

+5% improvement in Inverse Propensity Score (IPS) over the Netflix production model using SFT with reasoning
+3% improvement in IPS over the Netflix production model using Direct Preference Optimization (DPO)
Zero-shot Llama 3.1 8B performs significantly better than random guessing, demonstrating inherent world knowledge utility

Breakthrough Assessment

7/10

Novel application of LLMs to the specific industrial problem of artwork personalization. Demonstrates successful transfer of reasoning capabilities to visual preference tasks, though the scope is specific to one platform's dataset.

⚙️ Technical Details

Problem Definition

Setting: Personalized ranking of visual candidates for a given item based on user history

Inputs: User history U (recent K interactions), Title X, Candidate Artworks A_{1:m}(X) represented as text captions

Outputs: Predicted optimal artwork A*

Pipeline Flow

Visual Captioning: Convert all artwork images to text captions
Prompt Construction: Combine user history, title, and artwork captions into a prompt
LLM Inference: Predict the best artwork option (potentially with reasoning)

System Modules

Caption Generator

Convert visual artwork candidates into textual descriptions for the LLM

Model or implementation: Llama-3.2-11B (fine-tuned VLM)

Recommendation Predictor

Select the most appealing artwork for the user

Model or implementation: Llama-3.1-8B-Instruct

Novel Architectural Elements

Reasoning-augmented prediction pipeline where the model first outputs a 'Reason: ...' block derived from a teacher model before making the final selection

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT) with Reasoning Distillation and Direct Preference Optimization (DPO)

Objective Functions:

Purpose: SFT Loss.

Formally: Standard cross-entropy loss on the target tokens (prediction or reasoning+prediction).
Purpose: DPO Loss.

Formally: L_DPO = -E[log sigma(beta * log(pi(a_w|x)/pi_ref(a_w|x)) - beta * log(pi(a_l|x)/pi_ref(a_l|x)))] where a_w is the chosen artwork and a_l is a rejected artwork.

Adaptation: LoRA

Training Data:

110K user-title pairs for training
Reasoning data generated by Qwen/QwQ-32B conditioned on ground truth artwork

Key Hyperparameters:

learning_rate: Search space: {1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 1e-4}

Compute: Not reported in the paper

Comparison to Prior Work

vs. Netflix Production Model: Uses LLM reasoning and world knowledge vs. likely traditional latent factor or bandit models
vs. PALR: Focuses on visual sub-selection (artwork) rather than item selection (movie title)
vs. CLIP-based approaches [not cited in paper]: Uses LLM textual reasoning over captions rather than direct visual embedding similarity matching

Limitations

Relies on proprietary Netflix data, making external reproduction difficult
Inference latency and cost of LLM-based serving not analyzed compared to traditional lightweight rankers
Accuracy metric biased by candidate set size (addressed by IPS, but still a factor)
Depends on quality of VLM captions; poor captions could bottleneck performance

Reproducibility

Code not provided. Dataset is proprietary Netflix user data. Methodology relies on open weights models (Llama 3.1, Qwen/QwQ) but cannot be exactly reproduced without the private dataset.

📊 Experiments & Results

Evaluation Setup

Offline evaluation on held-out user-title pairs from Netflix logs

Benchmarks:

Netflix Internal Dataset (Personalized Artwork Selection) [New]

Metrics:

Inverse Propensity Score (IPS)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing post-trained LLMs against the Netflix Production Model baseline.
Netflix Internal Dataset	IPS (Inverse Propensity Score)	0.00	0.05	+0.05
Netflix Internal Dataset	IPS (Inverse Propensity Score)	0.00	0.03	+0.03
Ablation on output format showing sensitivity to text vs integer outputs.
Netflix Internal Dataset	Accuracy	0.224	0.244	+0.020
Netflix Internal Dataset	Accuracy	0.334	0.456	+0.122

Experiment Figures

Breakdown of model accuracy across different ground truth labels (artwork indices) for the 3B model.

Main Takeaways

Post-trained LLMs outperform production baselines, suggesting LLMs capture nuanced user preferences better than traditional models.
Reasoning distillation (teaching the model 'why' an artwork is good) yields the highest performance gains (+5%).
Model size matters for output formatting: smaller models (3B) bias toward small integers, while larger models (8B) handle text outputs more effectively.
Zero-shot performance is surprisingly strong, indicating pre-trained LLMs already possess relevant world knowledge for personalization.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (user history, items)
Large Language Models (SFT, prompting)
Preference Optimization (DPO)

Key Terms

IPS: Inverse Propensity Score—an evaluation metric that weights correct predictions based on the probability of the item being shown, accounting for varying candidate set sizes (e.g., guessing correctly out of 40 vs. 2 options)

SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs

DPO: Direct Preference Optimization—a method to align language models to preferences by optimizing the relative likelihood of chosen vs. rejected responses without a separate reward model

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices

Reasoning Distillation: Using a larger, more capable model to generate explanations or reasoning steps for a correct answer, then training a smaller model on these explanations

VLM: Visual Language Model—a model capable of understanding and generating text based on image inputs, used here to caption artworks