Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval

📝 Paper Summary

Generative Retrieval Conversational Recommendation Music Recommendation

Text2Tracks frames music recommendation as a generative retrieval task where a language model directly generates semantic track identifiers derived from collaborative filtering embeddings, outperforming text-based and integer-based ID strategies.

Core Problem

Standard LLM-based recommendation generates track titles (text) autoregressively, which is slow (many tokens), requires entity resolution to find the actual track ID, and fails to capture semantic similarity between items.

Why it matters:

Textual titles are inefficient for retrieval because decoding steps scale linearly with title length
Entity resolution is error-prone when multiple tracks share the same title or artist
Naive text generation does not leverage the rich collaborative filtering signals available in music recommendation datasets

Concrete Example: If a user asks for 'upbeat rock', a standard LLM might generate 'Bohemian Rhapsody by Queen'. This requires generating ~10 tokens and then a separate system to map that string to a database ID. If the title is misspelled or ambiguous, the lookup fails.

Key Novelty

Generative Retrieval with Semantic Track IDs (Text2Tracks)

Replaces text titles with learned semantic IDs: short sequences of tokens derived from collaborative filtering embeddings (vectors representing listening patterns)
Discretizes continuous track embeddings into hierarchical tokens using sparse coding, so tracks with similar listening patterns share ID prefixes
Fine-tunes a single transformer to map natural language prompts directly to these concise track IDs, skipping external entity resolution

Architecture

The Text2Tracks training and inference pipeline.

Evaluation Highlights

Semantic IDs (cf-based) outperform standard artist-name-track-name text generation by ~48% in Hits@10
Reduces decoding steps by ~7.5x compared to generating full text titles, significantly speeding up inference
Text2Tracks outperforms strong dense retrieval baselines (Bi-Encoder) and sparse baselines (BM25) on the prompt-to-track retrieval task

Breakthrough Assessment

7/10

Strong application of generative retrieval to the music domain with a novel ID discretization strategy. While generative retrieval is known, the specific adaptation to collaborative filtering embeddings for music IDs effectively solves the latency and entity resolution issues.

⚙️ Technical Details

Problem Definition

Setting: Generative Track Retrieval: Mapping a natural language query Q to a subset of relevant track IDs {t1...tm} from a collection T

Inputs: Natural language query Q (concatenation of user utterances in conversational setting)

Outputs: A sequence of tokens representing the unique identifier of a relevant music track

Pipeline Flow

Query Processing (Concatenate utterances)
Language Model Backbone (T5-based encoder-decoder)
Diversified Beam Search Decoding

System Modules

Query Processor

Formats the input dialogue or prompt into a single query string

Model or implementation: Deterministic string concatenation

Generative Retriever

Maps the natural language query directly to track identifiers

Model or implementation: T5-Base (220M parameters)

Decoder

Generates the final list of track IDs while ensuring variety

Model or implementation: Diversified Beam Search

Novel Architectural Elements

Integration of sparse-coding-based discretization (Semantic IDs) directly into the target vocabulary of a seq2seq retriever for music tracks
Utilization of collaborative filtering embeddings as the source space for learning these semantic IDs

Modeling

Base Model: T5-Base (220M parameters)

Training Method: Supervised Fine-Tuning (Seq2Seq)

Objective Functions:

Purpose: Minimize the difference between generated tokens and ground truth track IDs.

Formally: Standard Cross-Entropy Loss over the sequence of target ID tokens.

Training Data:

Dataset of 100k playlists with title/description as queries
Filtered to English language
Split: 100k playlists for train, 5k for validation, 10k for test
Catalog size: 27,872 unique tracks

Key Hyperparameters:

batch_size: 256
learning_rate: 1e-3
optimizer: Adafactor
+ 6 more
max_input_length: 64 tokens
max_output_length: 20 tokens
beam_size: 20
num_return_sequences: 10
dictionary_size (s): 256
coding_size (c): 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. DSI: Text2Tracks specifically optimizes ID creation using collaborative filtering signals rather than hierarchical k-means on document text
vs. Naive LLM Recommendation: Generates learned IDs instead of raw text, avoiding entity resolution steps
vs. Dense Retrieval (Bi-Encoder): Generative approach stores item knowledge in parameters rather than an external vector index

Limitations

Requires retraining the model or updating the ID vocabulary when new tracks are added (cold start problem for new items)
Evaluated on a relatively small catalog subset (~28k tracks) compared to real-world scale (millions)
Proprietary dataset makes direct comparison or reproduction difficult
Relies on existing collaborative filtering embeddings, assuming they are available and high-quality

Reproducibility

No code or model weights provided. The dataset is proprietary (playlist data). The ID discretization algorithm (sparse coding) is described mathematically but no reference implementation is linked.

📊 Experiments & Results

Evaluation Setup

Retrieving relevant tracks given a playlist title/description as a prompt

Benchmarks:

Proprietary Playlist Dataset (Prompt-based track retrieval) [New]

Metrics:

Hits@1
Hits@10
MRR@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different ID strategies within the Text2Tracks framework. 'learned-cf' refers to the proposed semantic IDs based on collaborative filtering.
Playlist Dataset	Hits@10	11.12	16.48	+5.36
Playlist Dataset	Hits@10	0.19	16.48	+16.29
Playlist Dataset	Hits@10	11.00	16.48	+5.48
Comparison against standard retrieval baselines.
Playlist Dataset	Hits@10	12.80	16.48	+3.68
Playlist Dataset	Hits@10	9.20	16.48	+7.28

Experiment Figures

Visual explanation of the three categories of ID strategies: Content-based (text), Integer-based, and Learned (semantic).

Main Takeaways

Semantic IDs derived from collaborative filtering (CF) embeddings are the most effective representation, significantly outperforming text-based and random integer IDs.
The method reduces decoding latency by ~7.5x compared to generating full text titles, as semantic IDs are much shorter.
Generative Retrieval (Text2Tracks) outperforms traditional Dense (Bi-Encoder) and Sparse (BM25) retrieval methods in this domain.
Text-based embeddings for ID generation perform similarly to raw text generation, suggesting that the unique value comes from the behavioral signals in CF embeddings.

📚 Prerequisite Knowledge

Prerequisites

Generative Retrieval (GR) concepts
Collaborative Filtering (CF) embeddings
Vector Quantization / Discretization techniques

Key Terms

Generative Retrieval: A search paradigm where a model generates the identifier of a document/item directly, rather than matching a query vector against a database index

Collaborative Filtering: A recommendation technique that predicts user preference based on patterns of items co-occurring in user history (e.g., playlists)

Semantic IDs: Item identifiers constructed such that similar items have similar IDs (e.g., sharing prefixes), allowing the model to learn relationships between items

Sparse Coding: A representation learning method where data is approximated as a sparse linear combination of a small set of basis vectors (dictionary)

Hits@k: A metric counting the proportion of queries for which at least one relevant item appears in the top-k retrieved results

MRR: Mean Reciprocal Rank—a metric that evaluates the ranking quality by averaging the reciprocal of the rank of the first relevant item

Diversified Beam Search: A decoding algorithm that penalizes generating similar sequences to ensure the set of recommended items covers different aspects of the request