SPAR: Personalized Content-Based Recommendation via Long Engagement Attention

📝 Paper Summary

Content-based Recommendation News Recommendation Book Recommendation

SPAR efficiently extracts user interests from long engagement histories by combining session-based language model encoding with sparse attention mechanisms and global summaries generated by large language models.

Core Problem

Pretrained Language Models (PLMs) struggle to process very long user engagement histories (often >5K tokens) due to quadratic attention complexity and token limits, leading to loss of fine-grained interest signals.

Why it matters:

Platforms like Google News or Reddit generate massive user histories that exceed standard model capacities, forcing systems to truncate data or lose cross-item context
Existing methods that encode items separately and average them fail to capture the complex, sequential evolution of user interests
Effective personalization requires maintaining standalone user/item embeddings for efficient retrieval while still capturing deep semantic interactions

Concrete Example: A user's history might contain 50 news articles (approx. 5K tokens). A standard BERT model (512 token limit) must either truncate 90% of the history or encode each article in isolation, missing the connection between a tech article read yesterday and a related financial article read today.

Key Novelty

Post-fusion with Sparse Poly-Attention for Recommendation (SPAR)

Encodes user history in sessions using a PLM to handle length, then aggregates them using a 'poly-attention' mechanism that projects thousands of tokens into a compact set of interest vectors
Applies strict sparsity rules (local window, global landmarks, random sampling) to the attention mechanism, allowing the model to attend to very long sequences without memory overflow or entropy collapse
Augments the raw history with a natural language summary of the user's global interests generated by an LLM (Llama-2), providing a high-level semantic anchor

Architecture

Overview of the SPAR framework showing the data flow from user history texts to final relevance score.

Evaluation Highlights

Outperforms SOTA method UNBERT by +1.48 AUC on the MIND news recommendation dataset
Achieves +1.15 AUC improvement over UNBERT on the Goodreads book recommendation dataset
Maintains superior performance even as user history length increases to 60 items, whereas baselines like MINER and UNBERT plateau or degrade

Breakthrough Assessment

7/10

Solid architectural advancement solving the specific 'long context' problem in content recommendation. Effectively combines modern LLM summarization with efficient attention mechanisms to beat strong PLM baselines.

⚙️ Technical Details

Problem Definition

Setting: Content-based CTR (Click-Through Rate) prediction

Inputs: User engagement history sequence E (titles, abstracts) and a candidate item e_j

Outputs: Relevance score s_{i,j} indicating probability of user i clicking item j

Pipeline Flow

Data Prep: Group user history into sessions + Generate LLM Summary
Shared Encoder: Encode history sessions and candidate item via PLM
User Side: User History Summarizing (UHS) -> User Interest Extracting (UIE)
Item Side: Candidate Content Summarizing (CCS)
Interaction: Weighted Dot Product -> Score

System Modules

LLM Profiler

Generate a natural language summary of global user interests from history

Model or implementation: Llama-2-70B-Chat

Content Encoder

Encode text tokens into dense vectors

Model or implementation: RoBERTa-base (Shared)

User History Summarizing (UHS) (Aggregation)

Aggregate token embeddings from all sessions into k content embeddings

Model or implementation: Sparse Poly-Attention Layer

User Interest Extracting (UIE) (Aggregation)

Distill content embeddings into final user interest vectors

Model or implementation: Poly-Attention Layer

Candidate Content Summarizing (CCS) (Aggregation)

Generate multiple representations for the candidate item

Model or implementation: Poly-Attention Layer

Novel Architectural Elements

Hierarchical aggregation pipeline: Session-PLM -> Sparse Poly-Attention (UHS) -> Poly-Attention (UIE)
Integration of LLM-generated summaries directly into the embedding sequence input
Three-way sparse attention strategy (local+global+random) specifically applied to codebook attention for recommendation

Modeling

Base Model: RoBERTa-base

Training Method: End-to-end Supervised Learning

Objective Functions:

Purpose: Maximize similarity between user and positive item embeddings while minimizing similarity to negatives.

Formally: NCE (Noise Contrastive Estimation) loss.

Training Data:

MIND (Small): 50k users, 161k logs
Goodreads: 200k users, 1.9M logs

Key Hyperparameters:

CCS_codebook_size: 4
UHS_local_window_size: 512 (MIND), 256 (Goodreads)
user_history_length: 60 items
+ 2 more
embedding_dimension: 200
negative_sampling_ratio: 4 (MIND), 2 (Goodreads)

Compute: Training comparison: SPAR (3 hours/epoch) vs SPAR-Longformer (8.5 hours/epoch) on 8x A100 GPUs

Comparison to Prior Work

vs. UNBERT/MINER: SPAR aggregates token-level embeddings from history rather than just [CLS] tokens, preserving fine-grained details
vs. Longformer: SPAR uses session-based encoding + sparse poly-attention which is 2.8x faster to train and performs better
vs. UniTRec: SPAR supports pre-computable standalone embeddings, whereas UniTRec uses a candidate-aware decoder that prevents efficient retrieval

Limitations

Relies solely on textual features, ignoring other modalities or ID-based features
Base-sized encoder (RoBERTa) may be too slow for real-time inference compared to lightweight CNN/RNN models
LLM summaries may contain hallucinations or biases which could propagate to recommendations

Reproducibility

Code availability is not provided in the paper. Datasets (MIND, Goodreads) are public. Use of Llama-2-70B is described but model weights/prompts are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Predict future clicks/ratings based on history

Benchmarks:

MIND (Small) (News Recommendation)
Goodreads (Book Recommendation)

Metrics:

AUC
MRR
nDCG@5
nDCG@10
Statistical methodology: t-test (p < 0.02 for MIND, p < 0.05 for Goodreads)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SPAR achieves state-of-the-art performance on both news and book recommendation benchmarks compared to strong PLM-based baselines.
MIND (Small)	AUC	71.73	73.21	+1.48
MIND (Small)	nDCG@10	42.92	44.01	+1.09
Goodreads	AUC	61.40	62.55	+1.15
Goodreads	MRR	73.34	73.97	+0.63
Ablation studies demonstrate the critical importance of the User History Summarizing (UHS) layer and the sparse attention mechanism.
MIND (Small)	AUC	72.15	73.21	+1.06
MIND (Small)	AUC	72.70	73.21	+0.51

Experiment Figures

Performance (AUC) comparison between SPAR, MINER, and UNBERT as the length of user engagement history increases (10 to 60 items).

Main Takeaways

SPAR consistently outperforms both traditional neural methods (NAML, NRMS) and PLM-based methods (UNBERT, MINER) across metrics.
The User History Summarizing (UHS) layer is the most critical component; its removal causes the largest drop in ablation studies.
Sparse attention not only improves efficiency but also performance (lower entropy in attention distribution) compared to full attention on long sequences.
The model scales well with history length, showing increasing gaps over baselines as user history grows from 10 to 60 items.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention)
Pretrained Language Models (BERT/RoBERTa)
Content-based filtering concepts
Attention mechanisms (Poly-attention/Codebook attention)

Key Terms

PLM: Pretrained Language Model—models like BERT or RoBERTa trained on massive text corpora to understand language semantics

Poly-attention: Also known as codebook-based attention; a mechanism that uses a set of learnable query vectors (codes) to extract multiple distinct representations from a sequence

LLM: Large Language Model—generative models like Llama-2 used here to summarize user history

AUC: Area Under the ROC Curve—a metric measuring the ability of a classifier to distinguish between positive (clicked) and negative (not clicked) items

NCE loss: Noise Contrastive Estimation loss—a training objective that teaches the model to distinguish the true target item from randomly sampled negative items

Session-based encoding: Breaking a long sequence into smaller chunks (sessions) to be encoded separately, reducing computational complexity

Attention Sparsity: Restricting the attention mechanism to look only at specific tokens (e.g., neighbors or global markers) rather than all tokens, reducing compute cost

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that gives more credit for correctly ranking highly relevant items at the top of the list