Language-Based User Profiles for Recommendation

📝 Paper Summary

Recommender Systems Large Language Models (LLMs) for Recommendation

The paper proposes replacing high-dimensional latent vectors with human-readable natural language summaries as user profiles, generated and processed by Large Language Models (LLMs) for transparent recommendation.

Core Problem

Conventional recommendation methods like matrix factorization represent users as high-dimensional vectors, which are unintelligible to humans, difficult to edit (steer), and often perform poorly in cold-start settings.

Why it matters:

Lack of transparency requires post-hoc explanation methods rather than offering intrinsic interpretability
Users cannot directly correct or update their profiles to change recommendations (lack of steerability)
Standard methods struggle to make accurate predictions when user interaction history is sparse (cold-start)

Concrete Example: A matrix factorization model might represent a user as a vector `[0.1, -0.5, ...]` which is meaningless to the user. In contrast, this system generates text like 'User enjoys sci-fi movies but dislikes horror,' which the user can read and potentially edit.

Key Novelty

Language-Based Factorization Model (LFM)

Replaces latent vector embeddings with a compact natural language summary of the user's interests
Uses an Encoder LLM to synthesize rating history into a text profile
Uses a Decoder LLM to read the text profile and perform downstream tasks like rating prediction or pairwise preference

Architecture

Illustration of the Language-Based Factorization Model (LFM) pipeline.

Evaluation Highlights

LFM performs competitively with direct LLM prediction (no summary) across rating, preference, and choice tasks, showing that compact text profiles capture necessary information
LFM outperforms standard Matrix Factorization (NMF) in cold-start settings (sparse user history)
LFM provides better reliability (parse success rate) than direct LLM prediction, particularly with Llama 2 13B

Breakthrough Assessment

7/10

Offers a significant shift in representation learning (vector to text) with strong potential for interpretability and steerability, though currently limited by the zero-shot performance compared to fully trained methods with background data.

⚙️ Technical Details

Problem Definition

Setting: Recommendation based on past user interactions (ratings)

Inputs: User rating history (set of items and scores)

Outputs: Predicted rating for a new item, or preference choice between two items

Pipeline Flow

Encoder LLM (History → Text Profile)
Decoder LLM (Text Profile + Test Item → Prediction)

System Modules

Encoder

Summarize user rating history into a natural language description of preferences

Model or implementation: Llama 2 (7B/13B) or Sakura-SOLAR 10.7B

Decoder

Predict user response to new items based on the generated text profile

Model or implementation: Llama 2 (7B/13B) or Sakura-SOLAR 10.7B

Novel Architectural Elements

Replacement of latent vector user embeddings with generated natural language text

Modeling

Base Model: Llama 2 7B, Llama 2 13B, Sakura-SOLAR 10.7B

Key Hyperparameters:

temperature: 0.6
top_p: 0.9
top_k: 50
+ 3 more
repetition_penalty: 1.2
nmf_factors: 15
nmf_epochs: 10

Compute: Inference run with float16 or bfloat16; specific runtimes logged in Appendix A (not fully detailed in text)

Comparison to Prior Work

vs. LLM-Direct: LFM forces an intermediate human-readable bottleneck (profile text) for interpretability
vs. NMF: LFM uses semantic knowledge from LLMs rather than just interaction matrix stats; LFM is zero-shot
vs. P5/ICAE: LFM prioritizes human-readability of the user representation over purely latent optimization

Limitations

Zero-shot approach prevents learning from background data (other users' histories)
LLM outputs can be unreadable or unparseable ('reliability' issue)
Suffers from integer-prediction bias (predicts 4 or 5, not 4.3)
High latency compared to matrix factorization (though profile generation can be offline)

Reproducibility

Source code will be on GitHub (currently private). Prompts are provided in Appendix C. Dataset is standard MovieLens Tag Genome 2021.

📊 Experiments & Results

Evaluation Setup

Movie recommendation using MovieLens Tag Genome Dataset 2021

Benchmarks:

MovieLens Tag Genome 2021 (Rating Prediction, Pairwise Preference, Pairwise Choice)

Metrics:

RMSE (Root Mean Square Error)
MAE (Mean Absolute Error)
Error Rate (for pairwise tasks)
Reliability (Parse success rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reliability results comparing the ability of models to produce parseable outputs.
MovieLens (Reliability)	Reliability Score	See Figure 2	See Figure 2	-
Performance comparisons on prediction tasks. (Note: Precise numeric values are not explicitly tabulated in the text, only plotted in Figures 3-6. Qualitative descriptions are used below based on the text's analysis of the plots.)
MovieLens	Error Rate / RMSE	High error (visual from Fig 3/6)	Lower error (visual from Fig 3/6)	-
MovieLens	Rating Prediction Error	Lower error (visual from Fig 3)	Higher error (visual from Fig 3)	-
MovieLens	Performance with Background Data	Greatly improved performance	Constant performance	-

Experiment Figures

Test error rates for rating prediction (RMSE/MAE), pairwise preference, and choice prediction across varying user history sizes.

Comparison of LFM vs NMF with increasing amounts of background (cross-user) data.

Main Takeaways

Natural language profiles (LFM) are competitive with direct LLM prediction, suggesting text summaries capture the necessary signal for recommendation.
LFM outperforms conventional Matrix Factorization in cold-start settings but falls behind when abundant background data allows NMF to learn better embeddings.
Profile length (50 vs 200 words) has minimal impact on rating prediction accuracy, suggesting short summaries are sufficient.
Zero-shot LLMs exhibit bias (e.g., integer-only predictions) and reliability issues (unparseable text) that limit performance compared to fine-tuned or trained baselines.

📚 Prerequisite Knowledge

Prerequisites

Matrix Factorization for recommendation
Large Language Models (Prompting, Zero-shot learning)
Cold-start problem in recommender systems

Key Terms

LFM: Language-based Factorization Model—the proposed architecture where user profiles are natural language text rather than vectors

Matrix Factorization: A conventional technique that decomposes a user-item interaction matrix into two lower-dimensional matrices (user and item embeddings)

NMF: Non-negative Matrix Factorization—a specific type of matrix factorization where values are constrained to be non-negative

Cold-start: The scenario where a recommender system has little to no data about a user or item, making prediction difficult

Zero-shot: Using a model to perform a task without any specific training examples for that task, relying only on its pre-trained knowledge

RMSE: Root Mean Square Error—a standard metric for measuring the differences between predicted and observed values

MAE: Mean Absolute Error—a metric measuring the average magnitude of errors in a set of predictions

Steerability: The ability for a user to directly influence or control the system's output (e.g., by editing their profile text)