EmbSum: Leveraging the Summarization Capabilities of Large Language Models for Content-Based Recommendations

📝 Paper Summary

User Modeling Content-Based Recommendation

EmbSum improves content recommendations by encoding users and items into multiple embedding vectors and supervising the user encoder with an auxiliary task where it must summarize user interests like an LLM.

Core Problem

Existing content-based recommenders either truncate user history to fit memory limits (losing long-term interests) or encode items independently (losing interaction context between items).

Why it matters:

Truncating history to ~1K tokens prevents systems from understanding a user's comprehensive long-term preferences
Online calculation strategies (concatenating user and candidate items) prevent efficient offline pre-computation, making inference too slow for real-world scale
Independent encoding fails to capture how a user's interest in one item (e.g., 'NBA') relates to another (e.g., 'Sneakers') within their history

Concrete Example: A user has browsed 60 news items, totaling 7,440 tokens. Standard BERT-based models truncate this to 512 or 1024 tokens, ignoring early history. EmbSum encodes all 60 items in chunks and fuses them by learning to generate a text summary (e.g., 'Interests: Sci-Fi and Cooking') supervised by an LLM.

Key Novelty

EmbSum (Embedding and Summarization)

User Poly-Embedding (UPE): Instead of a single vector, users are represented by multiple vectors derived via poly-attention to capture diverse interest facets
LLM-Supervised Summarization: Uses a large model (Mixtral) to generate 'gold' summaries of user history, then trains a smaller model's decoder to reproduce these summaries as an auxiliary task to force better representation learning

Architecture

The dual-branch architecture of EmbSum. Top branch: User history encoding via sessions, leading to UPE (Poly-Embedding) and a Summarization Decoder. Bottom branch: Candidate content encoding leading to CPE. Right side: Interaction via attention matching.

Evaluation Highlights

Outperforms SoTA UNBERT by +0.22 AUC on MIND dataset while using ~50% fewer parameters (61M vs 125M)
Achieves highest ranking accuracy (MRR 38.58) on MIND, surpassing MINER (38.10) and UniTRec (37.62)
Generating multiple embeddings for items (CPE) improves AUC by +3.78 on MIND compared to single-vector item representation

Breakthrough Assessment

7/10

Effective combination of parameter-efficient architecture (T5-small) with LLM supervision to beat heavier baselines. Incremental but consistent gains; strong practical value due to offline inference capability.

⚙️ Technical Details

Problem Definition

Setting: Click-Through Rate (CTR) prediction for content recommendation

Inputs: User history sequence E (k engaged items) and a candidate item e_j

Outputs: Relevance score s_ij indicating likelihood of engagement

Pipeline Flow

Session Encoding (T5 Encoder processes history chunks)
Summarization Branch (T5 Decoder generates text summary)
Embedding Branch (Poly-Attention generates UPE and CPE vectors)
Interaction (Attention matching between UPE and CPE)

System Modules

Session Encoder

Encodes user history chunks (sessions) independently to handle long sequences

Model or implementation: T5-small Encoder (shared)

Summarization Decoder

Fuses session embeddings by generating a natural language summary of interests (Auxiliary Task)

Model or implementation: T5-small Decoder

Poly-Attention Layer

Projects encoded features into multiple distinct embedding vectors

Model or implementation: Learnable context codes (matrix W)

Novel Architectural Elements

CPE (Content Poly-Embedding): Applying poly-attention to candidate items to generate multiple vectors per item, rather than the standard single [CLS] vector
Decoder-as-Fusion: Using the T5 decoder's summarization objective specifically to fuse independently encoded session chunks into a global user representation

Modeling

Base Model: T5-small (61M parameters)

Training Method: End-to-end training with multi-task loss (CTR prediction + Summarization)

Objective Functions:

Purpose: Distinguish clicked items from non-clicked items.

Formally: NCE Loss (Log-likelihood of positive score vs sum of negative scores)
Purpose: Force the model to capture global user interests by generating a summary.

Formally: Standard Language Modeling loss (negative log-likelihood of generating target summary tokens)

Training Data:

MIND-small (94K users, 65K news)
Goodreads (50K users, 330K books)
User summaries generated by Mixtral-8x22B-Instruct for supervision

Key Hyperparameters:

learning_rate: 5e-4
batch_size: 128
epochs: 10
+ 3 more
lambda (loss weight): 0.05
UPE_size: 32
CPE_size: 4

Compute: Parameters: 61M (T5-small). Training time/GPU not explicitly reported.

Comparison to Prior Work

vs. MINER: EmbSum applies poly-embeddings to *items* (CPE) as well, not just users
vs. UniTRec: EmbSum processes long history via session chunks + summarization fusion rather than truncating
vs. UNBERT: EmbSum allows offline pre-computation of embeddings; UNBERT requires online PLM inference

Limitations

Dependency on LLM-generated summaries for training (requires expensive preprocessing)
Relies on text content only (title/abstract), ignoring other modalities
Performance gains over baselines are statistically small (< 1%) despite architectural novelty

Reproducibility

Code availability is not provided. Implementation details (learning rate, batch size, codebook sizes) are listed. Requires using Mixtral-8x22B-Instruct to generate synthetic summaries for the training set (prompts provided in Figure 2).

📊 Experiments & Results

Evaluation Setup

Click-through rate prediction on offline datasets

Benchmarks:

MIND-small (News Recommendation)
Goodreads (Book Recommendation)

Metrics:

AUC
MRR
nDCG@5
nDCG@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against state-of-the-art baselines on MIND and Goodreads datasets.
MIND	AUC	71.73	71.95	+0.22
Goodreads	AUC	61.40	61.64	+0.24
MIND	MRR	38.10	38.58	+0.48
Ablation studies determining the contribution of specific components.
MIND	AUC	68.17	71.95	+3.78
MIND	AUC	71.43	71.95	+0.52

Experiment Figures

Sensitivity analysis of hyperparameters: Summarization loss weight (lambda), CPE size, and UPE size.

Main Takeaways

Parameter Efficiency: Surpasses BERT-base models (125M params) using only T5-small (61M params).
Item Poly-Embeddings matter: Representing candidate items with multiple vectors (CPE) is critical, contributing more to performance than the summarization loss itself.
Summarization as Supervision: Training the decoder to summarize history improves the encoder's ability to fuse session information, even if the summary isn't used at inference time.

📚 Prerequisite Knowledge

Prerequisites

Transformer Encoder-Decoder architectures (specifically T5)
Attention mechanisms (Self-attention, Poly-attention)
Contrastive Learning (NCE Loss)

Key Terms

UPE: User Poly-Embedding—representing a user with multiple vectors to capture diverse interests (e.g., one vector for 'sports', one for 'cooking')

CPE: Content Poly-Embedding—representing a single item (like a news article) with multiple vectors to capture its different aspects

Poly-attention: An attention mechanism that extracts m global feature vectors from a sequence using m learnable query codes

NCE Loss: Noise Contrastive Estimation—a loss function that trains models to distinguish positive samples (real user clicks) from negative samples (non-clicks)

PLM: Pretrained Language Model—models like BERT or T5 trained on large text corpora

Mixtral: A sparse mixture-of-experts Large Language Model used here to generate ground-truth summaries for training supervision

T5: Text-to-Text Transfer Transformer—an encoder-decoder model architecture used as the backbone for EmbSum

AUC: Area Under the ROC Curve—a metric measuring the ability of the model to distinguish between clicked and non-clicked items

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that gives higher weight to correct items appearing at the top of the list