Towards Graph Foundation Models for Personalization

📝 Paper Summary

Graph Foundation Models Industrial Recommender Systems

A domain-specific graph foundation model for personalization that combines a static heterogeneous graph neural network (HGNN) for learning general item representations with a dynamic Two-Tower model for efficient task adaptation.

Core Problem

Existing personalization approaches often build siloed solutions for different item types or struggle to adapt foundation models to dynamic, large-scale industrial catalogs where user preferences and items change frequently.

Why it matters:

Traditional siloed models fail to leverage shared information across different content types (e.g., podcasts vs. audiobooks), limiting recommendation quality.
Directly using LLMs for personalization at scale is challenging due to high latency and difficulty adapting quickly to catalog changes.
Standard GNNs often lack the generalization capabilities required to serve as a foundation model across diverse downstream tasks without frequent retraining.

Concrete Example: In an audio streaming platform, a user might listen to both podcasts and audiobooks. A siloed audiobook recommender cannot leverage the rich interaction signals from the user's podcast history to improve recommendations, whereas a unified graph model can transfer this knowledge.

Key Novelty

Static-Dynamic Decoupling for Graph Foundation Models

Splits the architecture into a 'static' foundation layer (HGNN + LLM) that learns general item embeddings from content and co-interaction graphs, and a 'dynamic' adaptation layer (Two-Tower model) that updates frequently.
Uses an LLM to featurize nodes solely based on text, allowing the graph to include any item type (podcasts, audiobooks) in a unified vector space without type-specific engineering.
De-couples content representation (handled by the heavy HGNN/LLM foundation) from user representation (handled by the lightweight Two-Tower model), enabling scalability.

Architecture

The overall Graph Foundation Model architecture, illustrating the Static Layer (Graph construction, HGNN training) and the Dynamic Layer (Two-Tower model adaptation).

Evaluation Highlights

Unified 2T model (trained on both podcasts and audiobooks) outperforms a content-specific audiobook-only baseline on Hit-Rate@10.
Ablation shows that removing the GNN component significantly degrades performance for both podcast and audiobook recommendations, proving the value of structural signals.
The 'static' HGNN foundation model remains stable over time: retraining it daily yields negligible performance gains compared to using a frozen version, validating the efficiency of the static/dynamic split.

Breakthrough Assessment

7/10

Presents a pragmatic, scalable architecture for Graph Foundation Models in industry. While not a theoretical breakthrough in GNNs, the static/dynamic decoupling and unified cross-content representation offer a strong blueprint for real-world application.

⚙️ Technical Details

Problem Definition

Setting: Personalized recommendation ranking in a large-scale industrial setting with heterogeneous item types

Inputs: User interaction history (streams), item metadata (descriptions), and item-item co-interaction graph

Outputs: Ranked list of items for a target user

Pipeline Flow

Input Processing: Text Featurization (LLM) + Graph Construction
Static Foundation Layer: HGNN Representation Learning
Dynamic Adaptation Layer: Two-Tower (2T) Model Training

System Modules

Text Featurizer (Input Processing)

Convert item descriptions into initial node features

Model or implementation: Sentence BERT

Graph Constructor (Input Processing)

Build heterogeneous item-item graph based on co-interactions

Model or implementation: Rule-based graph construction

Foundation HGNN

Learn general-purpose item embeddings capturing content and structure

Model or implementation: GraphSAGE (Heterogeneous variant)

Unified 2T Model

Adapt embeddings for specific personalization tasks (ranking)

Model or implementation: Two-Tower Neural Network

Novel Architectural Elements

Differentiation between 'static' (HGNN) and 'dynamic' (2T) layers to balance representation quality with training frequency
Type-agnostic item tower in the 2T model that relies on unified LLM+HGNN embeddings rather than type-specific features
Graph construction using purely item-item co-interactions (excluding user nodes) to keep the foundation layer generic

Modeling

Base Model: GraphSAGE (HGNN) + Sentence BERT (LLM) + Custom Two-Tower MLP

Training Method: Two-stage training: Self-supervised pre-training for HGNN, followed by Supervised Ranking loss for 2T

Objective Functions:

Purpose: Refine HGNN node representations to reflect structural proximity.

Formally: Self-supervised link-prediction loss (similar to GraphSAGE/PinSage)
Purpose: Optimize the 2T model for recommendation accuracy.

Formally: Standard recommendation loss (implied, likely softmax or pairwise ranking, exact loss not specified)

Training Data:

10M users, 3.5M podcasts, 250K audiobooks from Spotify
Training: 90 days of interaction data
Evaluation: Hold-out set of streams from the last 14 days

Compute: Not reported in the paper

Comparison to Prior Work

vs. ULTRA: Proposed model is domain-specific (personalization) rather than task-specific (KG completion).
vs. Standard 2T: Uses a single Unified Item Tower agnostic to content type, powered by HGNN embeddings, rather than separate towers for podcasts/audiobooks.
vs. Standard GNNs: Decouples the structural learning (static) from the user preference learning (dynamic), allowing infrequent retraining of the heavy graph component.
+ 1 more
vs. Xie et al. (Graph-aware LM) [cited]: Similar pre-training concept, but focuses on constructing a specific personalization foundation model rather than general graph-LM pre-training.

📊 Experiments & Results

Evaluation Setup

Next-item recommendation on a large-scale audio streaming platform

Benchmarks:

Spotify Internal Dataset (Item Recommendation (Podcast & Audiobook)) [New]

Metrics:

Hit-Rate@10 (HR@10)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generalization experiments comparing the Unified 2T model against content-specific baselines and ablations.
Spotify Internal Dataset	HR@10	0.149	0.158	+0.009
Spotify Internal Dataset	HR@10 (Audiobooks)	0.088	0.142	+0.054
Spotify Internal Dataset	HR@10 (Podcasts)	0.113	0.169	+0.056
Spotify Internal Dataset	HR@10 (Audiobooks)	0.142	0.141	-0.001

Main Takeaways

The Unified model generalizes better than single-content models, showing that cross-domain signals (podcasts helping audiobooks) are valuable.
The GNN component is critical; removing it and learning content vectors from scratch drastically reduces hit rate.
The foundation layer (HGNN) is temporally stable; infrequent retraining does not hurt performance, validating the static/dynamic architectural split for scalability.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs) and Heterogeneous Graphs
Two-Tower (2T) Recommendation Architectures
Large Language Models (LLMs) for text embedding
Self-supervised link prediction

Key Terms

HGNN: Heterogeneous Graph Neural Network—a GNN designed to handle graphs with multiple types of nodes (items) and edges (relations)

Two-Tower Model (2T): A recommendation architecture with separate neural networks (towers) for users and items, typically combining their outputs via a dot product to predict relevance

Foundation Model (FM): A large-scale model trained on vast data adaptable to many downstream tasks; here applied to graph data (GFM)

Static Layer: The heavy, infrequently updated part of the model (HGNN) that learns general-purpose item representations

Dynamic Layer: The lightweight, frequently updated part of the model (2T) that adapts representations to specific tasks and recent user behavior

Inductive capability: The ability of a model to generate embeddings for new nodes (items) not seen during training, crucial for dynamic catalogs

Co-interaction signals: Edges created between two items if they have been interacted with (e.g., streamed) by the same user

Hit-Rate@K: A metric measuring the proportion of test cases where the target item appears in the top K recommendations