End-to-end Training for Recommendation with Language-based User Profiles

📝 Paper Summary

Language-based User Profiles LLM-based Recommendation

LangPTune optimizes LLMs to generate language-based user profiles by treating the downstream recommender's performance as a reward signal, enabling end-to-end training that outperforms zero-shot approaches.

Core Problem

Existing language-based recommendation methods rely on zero-shot or few-shot LLM inference to generate user profiles, which are not optimized for the specific recommendation objective and often yield suboptimal performance.

Why it matters:

Language-based profiles offer transparency and scrutability that opaque embedding vectors lack, allowing users to understand and steer recommendations
Current methods fail to align the natural language profile generation with the actual ranking capability of the downstream system
There is a trade-off between the interpretability of text profiles and the accuracy of dense embeddings; bridging this gap is crucial for adoption

Concrete Example: A zero-shot LLM might summarize a user's history as 'likes sci-fi movies', which is human-readable but too generic for a recommender to distinguish between 'Star Wars' and 'Interstellar'. LangPTune trains the LLM to generate a profile like 'prefers space operas with high action', which the recommender can use to rank items more effectively.

Key Novelty

Reinforcement Learning for System Optimization (RLSO)

Treats the recommendation system (decoder) as an environment that provides feedback (ranking quality) to the LLM (encoder)
Iteratively updates the LLM to generate profiles that maximize downstream ranking performance using a reinforcement learning approach similar to RLHF but with system feedback instead of human preferences

Architecture

The inference pipeline of LangPTune. A profile encoder (LLM) takes item history to generate a text profile, which a recommender decoder uses to rank items.

Evaluation Highlights

LangPTune with Llama-3-8B-it outperforms the best zero-shot baseline by +17.5% on the pixel dataset (Recall@10)
Matches or exceeds the performance of state-of-the-art embedding-based methods (SASRec) on 2 out of 3 datasets while maintaining interpretability
User studies confirm that profiles optimized via RLSO maintain human readability and interpretability comparable to zero-shot profiles

Breakthrough Assessment

8/10

First end-to-end training pipeline for language-based user profiles that successfully bridges the gap between interpretability and recommendation performance.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation using natural language profiles

Inputs: User interaction history h (sequence of items)

Outputs: Ranked list of items r

Pipeline Flow

Profile Encoder (LLM) generates profile p from history h
Recommender Decoder computes similarity between p and item metadata D
Decoder produces ranked list r

System Modules

Profile Encoder

Generate a natural language description of user preferences based on interaction history

Model or implementation: Gemma-2B-it or Llama-3-8B-it

Recommender Decoder

Rank items based on the semantic similarity between the generated profile and item metadata

Model or implementation: Mxbai-embed-large-v1

Novel Architectural Elements

Joint optimization loop where the encoder (LLM) is trained via RL using rewards from the decoder, while the decoder is trained via contrastive learning on the encoder's outputs

Modeling

Base Model: Llama-3-8B-it or Gemma-2B-it (Encoder), Mxbai-embed-large-v1 (Decoder)

Training Method: Alternating optimization: Contrastive Learning for Decoder, RLSO for Encoder

Objective Functions:

Purpose: Optimize the encoder to maximize recommendation reward while staying close to the reference policy.

Formally: Squared loss objective derived from the inverted relationship between ranking quality and policy probability.
Purpose: Optimize the decoder to align profile and item embeddings.

Formally: InfoNCE contrastive loss.

Adaptation: LoRA (Low-Rank Adaptation) for the LLM encoder

Training Data:

Datasets: Amazon-M2 (Movies), pixel (Art), steam (Games)
Split: Chronological split (80/10/10 for train/val/test)

Key Hyperparameters:

learning_rate: 1e-5 (Encoder), 1e-5 (Decoder)
batch_size: 64 (Encoder), 256 (Decoder)
beta_KL: 0.1
+ 2 more
lora_r: 16
lora_alpha: 32

Compute: Single A100 (80GB) GPU for training

Comparison to Prior Work

vs. P5: LangPTune generates explicit interpretable profiles rather than just item IDs
vs. SASRec: LangPTune uses natural language intermediate representations, offering interpretability
vs. LLM-ZeroShot: LangPTune explicitly trains the profile generator for the recommendation objective [not cited in paper but implied baseline class]
+ 1 more
vs. TallRec: LangPTune focuses on profile generation optimization rather than tuning the LLM as a direct recommender [not cited in paper]

Limitations

Inference latency is higher than pure ID-based methods due to LLM generation
Requires high-quality item metadata (titles, descriptions) to work effectively
Dependency on the quality of the base LLM for initial profile coherence

Reproducibility

Code: https://github.com/ZhaolinGao/LangPTune

Code is publicly available at https://github.com/ZhaolinGao/LangPTune. Datasets are public. Hyperparameters are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on three public datasets

Benchmarks:

Amazon-M2 (Movie recommendation)
pixel (Digital art recommendation)
steam (Video game recommendation)

Metrics:

Recall@10
NDCG@10
MRR
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LangPTune significantly outperforms zero-shot and few-shot language-based baselines across all datasets.
pixel	Recall@10	0.0890	0.1046	+0.0156
steam	NDCG@10	0.0617	0.0754	+0.0137
LangPTune achieves comparable performance to state-of-the-art ID-based embedding methods (SASRec).
pixel	Recall@10	0.0969	0.1046	+0.0077
Amazon-M2	NDCG@10	0.1385	0.1264	-0.0121

Experiment Figures

Illustration of the RLSO training process. Sampling multiple profiles, getting rewards from the decoder, and updating the encoder.

Main Takeaways

End-to-end training (RLSO) bridges the performance gap between interpretable language profiles and opaque embedding vectors.
LangPTune consistently outperforms other language-based approaches (zero-shot, few-shot, prompt tuning).
Performance is robust across different backbone LLMs (Gemma-2B, Llama-3-8B), with larger models generally performing better.
Qualitative analysis and user studies show that the optimized profiles remain human-readable and interpretable, despite being trained for machine performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Contrastive Learning
Recommender Systems (Sequential Recommendation)
Large Language Models

Key Terms

RLSO: Reinforcement Learning for System Optimization—the proposed algorithm that optimizes the LLM based on feedback (rewards) from the recommendation system

Profile Encoder: An LLM that maps user interaction history to a natural language profile

Recommender Decoder: A model (often embedding-based) that takes a user profile and item metadata to generate a ranked list of items

Mxbai: A state-of-the-art text embedding model used as the backbone for the recommender decoder

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

MRR: Mean Reciprocal Rank—a statistical measure for evaluating any process that produces a list of possible responses to a sample of queries

Recall@K: The proportion of relevant items found in the top-K recommendations

InfoNCE: A contrastive loss function used to learn representations by maximizing agreement between positive pairs and minimizing it between negative pairs

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution is different from a second, reference probability distribution