NoteLLM: A Retrievable Large Language Model for Note Recommendation

📝 Paper Summary

Item-to-Item (I2I) Recommendation LLMs for Recommendation

NoteLLM jointly trains a Large Language Model to compress notes into embeddings for retrieval and generate hashtags/categories, enhancing item-to-item recommendation through multi-task learning.

Core Problem

Existing item-to-item (I2I) recommendation methods typically use BERT-based models that underutilize key conceptual cues like hashtags and categories, treating them merely as text rather than core semantic summaries.

Why it matters:

Hashtags and categories represent the central ideas of user-generated notes, which are crucial for determining content relatedness but are often diluted in long text
Standard BERT-based embeddings fail to capture the generative connection between a note's content and its condensed summary (hashtag), missing a strong signal for relevance
LLMs have superior language understanding but are rarely used for I2I retrieval due to the challenge of adapting generative models to produce dense vector representations for millions of items

Concrete Example: A note about a trip might mention 'Marina Bay Sands' and 'Merlion Park'. A standard model might match this to general travel posts. However, the hashtag '#Singapore' explicitly defines the core topic. If the model cannot generate or predict this hashtag, it might miss the strong connection to other notes explicitly tagged '#Singapore'.

Key Novelty

Unified Generative-Contrastive Learning for Note Compression

Compresses an entire note into a single virtual token (via a prompt) that serves as the note's dense embedding for retrieval
Jointly trains the LLM on two tasks: (1) Contrastive learning to pull embeddings of co-occurring notes closer, and (2) Generative learning to produce valid hashtags/categories from the compressed token
The generative task forces the compressed token to retain high-level semantic concepts (like topics), while the contrastive task injects collaborative user preference signals

Architecture

The NoteLLM framework, detailing the prompt structure, the LLM processing, and the two training branches: Generative-Contrastive Learning (GCL) and Collaborative Supervised Fine-tuning (CSFT).

Evaluation Highlights

+15.1% improvement in Recall@1 over the online baseline (BERT-based) in offline experiments on the Xiaohongshu dataset
+12.8% improvement in AUC in online A/B testing on the Xiaohongshu platform compared to the previous production system
Outperforms standard sentence embedding models (e.g., SimCSE, Sentence-BERT) by significant margins on precision and recall metrics

Breakthrough Assessment

7/10

Novel application of LLMs to item-to-item recommendation via token compression. Strong industrial results (Xiaohongshu) validate the approach, though the core idea combines existing contrastive and generative techniques.

⚙️ Technical Details

Problem Definition

Setting: Item-to-Item (I2I) note recommendation where a target note is used to retrieve relevant notes from a pool

Inputs: A target note n_i containing title, hashtag, category, and content

Outputs: A ranked list of top-k similar notes from the candidate pool

Pipeline Flow

Prompt Construction: Input note → Note Compression Prompt
LLM Encoder: Prompt → [EMB] token hidden state
Contrastive Branch: [EMB] state → GCL Loss (match related notes)
Generative Branch: [EMB] state → CSFT Loss (generate hashtags/categories)

System Modules

Note Compression Prompt

Wrap note content in a template that requests compression and hashtag generation

Model or implementation: Template-based string formatter

LLM Backbone

Process the prompt to produce a hidden representation for the [EMB] token and generate text

Model or implementation: LLaMA 2 (7B, 13B) or Ziya-13B

Recommendation Head (GCL)

Project [EMB] hidden state to embedding space and compute contrastive loss

Model or implementation: Linear Projection Layer

Generation Head (CSFT)

Generate hashtags or categories from the compressed state

Model or implementation: LLM Language Modeling Head

Novel Architectural Elements

Dual-task prompt architecture where a single special token [EMB] acts as both the bottleneck for generation and the embedding for retrieval

Modeling

Base Model: LLaMA 2 (7B, 13B) and Ziya-13B

Training Method: Multi-task learning combining Contrastive Learning and Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Pull embeddings of co-occurring notes together and push others apart.

Formally: InfoNCE-style loss L_cl = - log(exp(sim(n_i, n_i+)/tau) / sum(exp(sim(n_i, n_neg)/tau)))
Purpose: Train model to generate correct hashtags/categories.

Formally: Standard language modeling loss L_gen = - sum(log P(o_t | o_<t, i))
Purpose: Combine both objectives.

Formally: L = L_cl + alpha * L_gen

Adaptation: Full fine-tuning (implied by context of adapting LLM for embeddings)

Trainable Parameters: Not explicitly specified as PEFT or full, but context suggests updating LLM weights

Training Data:

Dataset: Xiaohongshu (industrial social media platform)
Data split: Training set (behavior data from one week), Testing set (next day's data)
Co-occurrence construction: Pairs (n_A, n_B) where user views A then clicks B. Weighted by user activity.

Key Hyperparameters:

batch_size: 120
learning_rate: 5e-5
epochs: 2
+ 6 more
max_length: 1024
learning_rate_scheduler: cosine
warmup_ratio: 0.03
weight_decay: 0.001
alpha (loss weight): 1.0
r (ratio of hashtag tasks): 0.5 (half batch for hashtag gen, half for category)

Compute: 8 NVIDIA A100 (40G) GPUs

Comparison to Prior Work

vs. BERT/SimCSE: NoteLLM uses an LLM backbone (LLaMA-2) instead of BERT and integrates generation tasks to inform the embedding
vs. Unsupervised methods: NoteLLM leverages behavioral co-occurrence signals (collaborative filtering implicit in data) rather than just semantic similarity
vs. Prompt-BERT [not cited in paper]: Similar idea of using prompts for embeddings, but NoteLLM adds the generative auxiliary task (hashtags) which is key to its performance

Limitations

Dependency on large-scale behavioral data for co-occurrence label construction (cold start problem for new notes without clicks)
High computational cost for training and inference compared to BERT-based models due to LLM size (7B+ params)
Effectiveness relies on the quality of user-generated hashtags, which can be noisy or missing

Reproducibility

Code: https://github.com/W-Shiwei/NoteLLM

Code is publicly available at https://github.com/W-Shiwei/NoteLLM. Dataset is proprietary (Xiaohongshu) and likely not released due to privacy/industrial nature, though the paper mentions 'Extensive validations on real scenarios'.

📊 Experiments & Results

Evaluation Setup

Offline evaluation using Recall@k on real-world data and Online A/B testing

Benchmarks:

Xiaohongshu Dataset (I2I Recommendation) [New]

Metrics:

Recall@10
Recall@50
Recall@1
AUC (for online test)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main offline comparison shows NoteLLM significantly outperforming baselines on the Xiaohongshu dataset.
Xiaohongshu (Offline)	Recall@1	11.23	12.93	+1.70
Xiaohongshu (Offline)	Recall@10	27.46	32.14	+4.68
Xiaohongshu (Offline)	Recall@50	46.22	52.32	+6.10
Xiaohongshu (Online A/B Test)	AUC	0.602	0.679	+0.077
Ablation studies demonstrate the contribution of each component (CSFT and GCL).
Xiaohongshu (Offline)	Recall@10	30.15	32.14	+1.99
Xiaohongshu (Offline)	Recall@10	4.56	32.14	+27.58

Experiment Figures

Comparison of different input cues for note recommendation (Title, Content, Hashtag/Category) and how they overlap.

Main Takeaways

Integrating hashtag generation (CSFT) with contrastive learning (GCL) significantly boosts recommendation performance, confirming that learning to summarize topics helps embedding quality.
LLMs (LLaMA-2) vastly outperform BERT-based baselines for this task, likely due to better semantic understanding and capacity.
The method is robust across different LLM backbones (LLaMA-2-7B, 13B, Ziya-13B), with larger models generally performing better.
Ablations show that GCL is the primary driver of performance, but CSFT provides a critical boost by refining the semantic focus of the embeddings.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Learning (SimCLR style)
Large Language Models (LLMs) and Instruction Tuning
Item-to-Item (I2I) Recommendation basics

Key Terms

I2I recommendation: Item-to-Item recommendation—recommending items similar to a specific target item the user is currently viewing

GCL: Generative-Contrastive Learning—a training approach combining contrastive loss (distinguishing related items) and generative loss (producing text/hashtags)

CSFT: Collaborative Supervised Fine-tuning—fine-tuning the model to generate hashtags/categories, termed 'Collaborative' because it enhances the embedding used for recommendation

Note Compression Prompt: A specific prompt template designed to instruct the LLM to compress input text into a single special token ([EMB]) before generating output

co-occurrence mechanism: A method to define 'related' notes based on user behavior: if users frequently view Note A then click Note B, they are considered related

virtual token: A special token (e.g., [EMB]) inserted into the sequence whose hidden state is treated as the dense vector representation of the entire input

AUC: Area Under the ROC Curve—a metric measuring the ability of a classifier to distinguish between positive and negative samples

Recall@k: The proportion of relevant items found in the top-k recommendations