Beyond Single Labels: Improving Conversational Recommendation through LLM-Powered Data Augmentation

📝 Paper Summary

Conversational Recommender Systems (CRS) Data Augmentation

This paper augments CRS training data by using LLMs to identify semantically relevant items often missed as false negatives, then employs a two-stage training strategy to balance this semantic augmentation with collaborative filtering signals.

Core Problem

Conversational recommender systems suffer from the 'false negative issue,' where relevant items are incorrectly treated as negative samples during training because the user was never exposed to them.

Why it matters:

Standard training pairs (query, single recommended item) ignore other valid items the user would like, reducing recommendation quality
Simply expanding labels with LLMs risks over-prioritizing semantic relevance while ignoring crucial collaborative information (e.g., item popularity, user trends)
Prior work in traditional recommender systems addresses false negatives, but few approaches exist specifically for the conversational setting

Concrete Example: If a user asks for 'silly cop movies', the training data might only label one specific movie as positive. The system treats all other silly cop movies as negative (irrelevant), preventing the model from learning to recommend them in future similar contexts.

Key Novelty

Two-Stage Semantic-Collaborative Data Augmentation

First, use an LLM to retrieve and score multiple semantically relevant items for a dialogue context, ignoring collaborative signals to avoid bias
Second, train the recommender in two stages: pre-train on these diverse synthetic semantic labels, then fine-tune on the original data to reintegrate collaborative information (like popularity trends)

Architecture

Overview of the proposed approach: Data Synthesis (retrieval + scoring) and Two-Stage Model Training.

Evaluation Highlights

Outperforms state-of-the-art baselines significantly on ReDial and INSPIRED datasets (e.g., +2.5% to +18.9% relative improvement on Recall@50 for KGSF on ReDial)
Consistent improvements across multiple base recommender architectures (KGSF, KBRD, UniCRS) and datasets
Demonstrates robustness in user simulation experiments using iEvaLM

Breakthrough Assessment

7/10

Offers a solid, practical solution to the false negative problem in CRS by effectively balancing LLM semantic knowledge with traditional collaborative signals, showing consistent gains.

⚙️ Technical Details

Problem Definition

Setting: Conversational Recommendation where the goal is to recommend items given a dialogue context

Inputs: Dialogue context C (history of utterances) and a set of items I

Outputs: A ranked list of items recommended to the user

Pipeline Flow

Data Synthesis Group: Semantic Retriever → Relevance Scorer
Training Group: Pre-training (Synthetic Data) → Fine-tuning (Real Data)

System Modules

Semantic Retriever (Data Synthesis)

Retrieve potentially relevant items based on semantic similarity between dialogue context and item descriptions

Model or implementation: LLM-based text encoder (off-the-shelf)

Relevance Scorer (Data Synthesis)

Filter retrieved items by assigning fine-grained relevance scores

Model or implementation: Gemma2-9b (trained on GPT-4 generated triples)

Recommender (Base Model)

The core CRS model being improved (e.g., KGSF, KBRD, UniCRS)

Model or implementation: Various architectures (KGSF, KBRD, UniCRS)

Novel Architectural Elements

Two-stage training pipeline explicitly designed to decouple semantic learning (Stage 1) from collaborative learning (Stage 2)

Modeling

Base Model: Gemma2-9b (for relevance scoring), plus various CRS backbones (KGSF, KBRD, UniCRS)

Training Method: Two-stage Supervised Learning: Pre-training on augmented data, Fine-tuning on original data

Objective Functions:

Purpose: Pre-train on synthetic data to learn semantic relevance.

Formally: Standard cross-entropy loss L_pre = - sum(y_ij * log(P(i,j))) on synthetic dataset.
Purpose: Fine-tune on real data to integrate collaborative information.

Formally: Cross-entropy loss L_ft = - sum(y_ij * log(P(i,j))) on real dataset.
Purpose: Balance semantic and collaborative signals during fine-tuning using label smoothing.

Formally: L = (1-alpha) * L_ft + alpha * D_KL(y_hat || P), where y_hat are soft labels from the pre-trained model.

Key Hyperparameters:

relevance_threshold: 3.5
alpha (smoothing weight): Not explicitly reported in the paper (implied variable in eq 4)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Training (KGSF/KBRD/UniCRS): Our method introduces a data augmentation phase using LLMs to find false negatives, followed by a specific two-stage training to handle these new labels.
vs. Wei et al. (2024) [Graph-based FNS mitigation]: We prioritize semantic relevance first without collaborative info to avoid bias, whereas Wei et al. use both initially. We also use a two-stage training to re-integrate collaborative info later.

Limitations

Reliance on LLMs for data augmentation introduces dependency on the quality of the LLM and potential hallucinations.
The two-stage process increases training complexity and time compared to single-stage training.
Effectiveness depends on the quality of semantic descriptions available for items.
Optimal threshold for relevance scoring (3.5) may be dataset-dependent.

Reproducibility

Code: https://github.com/xu1110/FNSCRS

Code is publicly available at https://github.com/xu1110/FNSCRS. The paper details the use of GPT-4 for generating training data for the scorer and Gemma2-9b as the scorer. Specific hyperparameters for the base recommenders (KGSF, KBRD, UniCRS) generally follow their original implementations.

📊 Experiments & Results

Evaluation Setup

Evaluated on recommendation performance within multi-turn dialogues.

Benchmarks:

ReDial (Conversational Recommendation (Movie))
INSPIRED (Conversational Recommendation (Movie))
iEvaLM (User Simulator for CRS Evaluation)

Metrics:

Recall@1
Recall@10
Recall@50
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on ReDial dataset showing consistent improvements across different backbone models.
ReDial	Recall@50	0.148	0.176	+0.028
ReDial	Recall@50	0.158	0.183	+0.025
ReDial	Recall@50	0.178	0.186	+0.008
Performance on INSPIRED dataset.
INSPIRED	Recall@50	0.082	0.103	+0.021
INSPIRED	Recall@50	0.187	0.198	+0.011

Experiment Figures

A conceptual illustration of the False Negative Issue in CRS. It shows a user asking for 'silly cop movies' and the system only labeling one movie as positive, treating other valid candidates as negative.

Main Takeaways

The proposed data augmentation method consistently improves recommendation performance (Recall) across multiple CRS backbones (KGSF, KBRD, UniCRS).
Two-stage training successfully balances the benefits of semantic augmentation (handling false negatives) with the necessity of collaborative information.
The method is robust and effective on both human-annotated datasets (ReDial, INSPIRED) and simulator-based evaluation (iEvaLM).
Retrieving semantically relevant items first without collaborative bias helps cover a wider range of items (long-tail) before re-introducing popularity signals.

📚 Prerequisite Knowledge

Prerequisites

Conversational Recommender Systems (CRS)
Large Language Models (LLMs) for retrieval/scoring
Collaborative Filtering principles
False Negative Sampling issues

Key Terms

CRS: Conversational Recommender Systems—systems that elicit user preferences through multi-turn dialogue to make recommendations

False Negative Samples (FNS): Items that a user would actually like but are treated as negative examples during training because they were not explicitly interacted with

Collaborative Information: Patterns of user behavior and item popularity derived from interaction history (e.g., 'users who bought X also bought Y'), distinct from semantic content

Semantic Relevance: The conceptual similarity between a user's request (text) and an item's description (text), independent of popularity

Label Smoothing: A regularization technique where hard targets (0 or 1) are replaced with softer probabilities to prevent overfitting and aid generalization

Chain-of-Thought: A prompting strategy where the LLM is asked to generate intermediate reasoning steps before producing a final answer

Recall@K: A metric measuring the proportion of relevant items found in the top-K recommendations