TEARS: Textual Representations for Scrutable Recommendations

📝 Paper Summary

Scrutable Recommender Systems Collaborative Filtering with LLMs

TEARS aligns LLM-generated textual user summaries with collaborative filtering embeddings using optimal transport, enabling users to edit their preferences in natural language while maintaining high recommendation performance.

Core Problem

Traditional recommender systems use high-dimensional numeric embeddings that are opaque (hard to interpret) and offer limited control to users, who can typically only influence recommendations through coarse interactions like clicks.

Why it matters:

Users lack transparency into why items are recommended, as numeric latent vectors are uninterpretable.
Correcting bad recommendations is tedious; users must consume/rate more items hoping for a change rather than directly editing their profile.
Existing scrutable methods (like tag clouds) impose high cognitive load or sacrifice performance compared to black-box models.

Concrete Example: A user wants to stop receiving horror movie recommendations. In standard systems, they must laboriously rate horror movies negatively. In TEARS, they can simply delete 'Horror' from their textual summary or add 'I dislike scary movies' to immediately update the recommendations.

Key Novelty

TExtuAl Representations for Scrutable recommendations (TEARS)

Replaces or augments opaque numeric user embeddings with natural language summaries generated by an LLM based on user history.
Aligns the textual embedding space with a powerful black-box collaborative filtering space using Optimal Transport (OT) to ensure text edits meaningfully impact recommendations.
Allows a tunable mix (convex combination) of text-based and behavior-based representations, letting users trade off between pure controllability and maximum historical accuracy.

Architecture

The TEARS framework illustrating the dual-encoder setup. It shows how user history is processed by a black-box VAE and user summaries are processed by a text encoder. The two latent representations are aligned via Optimal Transport and combined (via alpha) before decoding.

Evaluation Highlights

TEARS-RecVAE outperforms the standard RecVAE baseline on the ML-1M dataset (NDCG@100: 0.444 vs 0.434), showing that adding aligned text representations improves performance.
In a 'flip' controllability task (swapping favorite/least-favorite genres), TEARS achieves a 99.7% success rate in shifting recommendations, compared to 0.0% for a genre-tag baseline.
Using Optimal Transport for alignment yields significantly better controllability (99.7% Flip Ratio) compared to contrastive loss (24.7%) or JS-divergence (31.3%) baselines.

Breakthrough Assessment

8/10

Successfully bridges the gap between high-performance black-box recommenders and user-controllable text interfaces. The use of Optimal Transport for alignment is a strong methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Collaborative filtering for implicit feedback, predicting user preferences from historical interactions and textual summaries.

Inputs: User-item feedback matrix X and generated natural language user summaries S.

Outputs: Predicted relevance scores for items (recommendations) Y.

Pipeline Flow

Summary Generation: LLM creates text summary from user history
Encoding: Text Encoder (T5) and VAE Encoder map summary and history to latent space
Alignment: Optimal Transport aligns text latents with VAE latents
Decoding: Shared decoder predicts items from combined latents

System Modules

Summary Generator

Generates a natural language summary of user preferences based on rating history and item metadata.

Model or implementation: GPT-4-turbo (or LLaMA 3.1-405b for reproducibility)

Text Encoder (Q_s) (Representation Learning)

Encodes the textual summary into a latent Gaussian distribution.

Model or implementation: T5-base with LoRA adapters + MLP head

CF Encoder (Q_r) (Representation Learning)

Encodes historical interaction vector into a latent Gaussian distribution (standard VAE encoder).

Model or implementation: Encoder from backbone VAE (e.g., RecVAE, MultVAE)

Shared Decoder (D)

Reconstructs user preferences (item probabilities) from the latent representation.

Model or implementation: Decoder from backbone VAE

Novel Architectural Elements

Hybrid latent space construction via Optimal Transport alignment between a textual encoder (T5) and a collaborative filtering VAE encoder.
Inference-time mixing mechanism allowing users to control alpha (the interpolation coefficient) to shift between text-driven and history-driven recommendations.

Modeling

Base Model: T5-base (for text encoding) and various VAE backbones (RecVAE, MultVAE, MacridVAE)

Training Method: Joint training of Text Encoder and Decoder with OT regularization

Objective Functions:

Purpose: Minimize reconstruction error of preferences.

Formally: standard VAE multinomial log-likelihood loss for both text (s) and rating (r) representations.
Purpose: Align text embeddings with rating embeddings.

Formally: 2-Wasserstein distance between Gaussian distributions N(mu_r, sigma_r) and N(mu_s, sigma_s).
Purpose: Regularize the text latent space.

Formally: KL-divergence between Z_s and a standard Gaussian prior N(0, I).

Adaptation: LoRA (Low-Rank Adaptation) on T5-base

Trainable Parameters: T5 LoRA parameters, MLP projection head, Decoder weights (CF encoder weights are frozen)

Training Data:

ML-1M (Movies), Amazon-Books (Books)
User summaries generated via GPT-4-turbo

Key Hyperparameters:

alpha: 0.5 (during training)
summary_length: 200 words
max_items_for_summary: 50
+ 1 more
latent_dimension: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. GERS: TEARS uses rich natural language summaries allowing fine-grained control (plots, themes) vs. coarse genre tags.
vs. P5 / LLM-based Recommenders: TEARS aligns text with a high-performance CF latent space rather than using the LLM directly for scoring, avoiding latency/cost issues and retaining CF accuracy.
vs. Concurrent work (Zhou et al., 2024): TEARS focuses on the editing/controllability aspect via OT alignment rather than just profile quality analysis.

Limitations

Reliance on proprietary LLMs (GPT-4) for high-quality summary generation creates a dependency and cost.
Summaries are generated based on a limited history (max 50 items), potentially missing long-tail preferences.
Requires re-encoding the text summary whenever the user makes an edit (though this is faster than retraining the whole model).
Evaluation of controllability relies on simulated user tasks rather than real-user studies.

Reproducibility

Code: https://github.com/Emilianopp/TEARS

Code available at https://github.com/Emilianopp/TEARS. User summaries generated by GPT-4-turbo are provided. LLaMA 3.1-405b summaries also generated for reproducibility. Hyperparameters for specific baselines (MultVAE, RecVAE) follow original papers.

📊 Experiments & Results

Evaluation Setup

Top-N recommendation and controllability simulation.

Benchmarks:

ML-1M (Movie Recommendation)
Amazon-Books (Book Recommendation)

Metrics:

NDCG@100 (Ranking quality)
Recall@100
Flip Ratio (Controllability)
Targeted Help Ratio (Controllability)
Correction Ratio (Controllability)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons show TEARS often matches or exceeds the performance of strong black-box VAE baselines by incorporating text information.
ML-1M	NDCG@100	0.434	0.444	+0.010
Amazon-Books	NDCG@100	0.169	0.170	+0.001
ML-1M	NDCG@100	0.126	0.444	+0.318
Controllability experiments demonstrate the effectiveness of user edits in changing recommendations.
ML-1M (Simulation)	Flip Ratio	0.0	99.7	+99.7
ML-1M (Simulation)	Targeted Help Ratio	0.0	39.0	+39.0
Ablation on alignment loss confirms Optimal Transport (OT) is critical for controllability.
ML-1M	Flip Ratio	24.7	99.7	+75.0

Experiment Figures

Impact of the interpolation coefficient (alpha) on performance (NDCG) and controllability (Flip Ratio).

Main Takeaways

Aligning text embeddings with strong CF embeddings via Optimal Transport allows TEARS to surpass the performance of black-box models while adding interpretability.
TEARS provides superior controllability compared to tag-based systems (GERS) and alternative alignment methods (Contrastive, JS-Divergence), especially for large preference shifts ('Flip' task).
Pure LLM-based recommendations (Zero-shot GPT-4) perform significantly worse than the hybrid VAE approach, justifying the need for the TEARS architecture.
Text summaries allow for targeted corrections (fixing specific item ranks) that are impossible with coarse genre-based controls.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (CF) and Variational Autoencoders (VAEs)
Latent variable models
Natural Language Processing (NLP) with LLMs
Optimal Transport (Wasserstein distance)

Key Terms

scrutable: Understandable and editable by users; in this context, referring to user representations in natural language.

VAE: Variational Autoencoder—a generative model used here to learn latent representations of user preferences from interaction data.

Optimal Transport (OT): A mathematical framework for finding the most efficient way to transform one probability distribution into another; used here to align text embeddings with collaborative filtering embeddings.

convex combination: A linear combination of vectors where coefficients sum to 1; used here to mix text-based and history-based user embeddings.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that prioritizes correct recommendations at the top of the list.

RecVAE: A specific high-performance Variational Autoencoder architecture for collaborative filtering.

LLM: Large Language Model—used here to generate summaries and encode text.

Flip Ratio: A custom metric measuring the percentage of times the system successfully recommends items from a previously disliked genre after the user edits their profile to like it.

Targeted Help Ratio: A custom metric measuring how often a specific target item's rank improves after a user edits their summary to describe that item.