LLM-I2I: Boost Your Small Item2Item Recommendation Model with Large Language Model

📝 Paper Summary

Data-centric Recommendation Generative Data Augmentation Large Language Model Enhanced Recommender Systems (LLM-ERS)

LLM-I2I enhances lightweight Item-to-Item recommendation models by using an LLM-based generator to synthesize long-tail interactions and an LLM-based discriminator to filter noisy data.

Core Problem

Traditional I2I models struggle with data sparsity for long-tail items and are sensitive to noise in user interaction logs, leading to inaccurate similarity calculations.

Why it matters:

Long-tail items (e.g., 20% of Amazon Toys items have <5 purchases) are rarely recommended due to insufficient interaction history, hurting revenue and discovery
Raw click data contains accidental clicks and noise; training on unfiltered synthetic data often degrades performance due to distribution drift and hallucinations
Deploying massive deep models is often too computationally expensive for real-time industrial retrieval, making enhancement of existing lightweight I2I models highly desirable

Concrete Example: A user clicks a niche toy by accident. A standard I2I model treats this as a strong signal, recommending similar irrelevant toys. Simply adding LLM-synthesized data might generate 'hallucinated' interactions that never happened. LLM-I2I generates potential interests (solving sparsity) but then uses a discriminator to reject the accidental click or bad synthetic pair (solving noise).

Key Novelty

LLM-Enhanced Data Augmentation Pipeline (Generate + Discriminate)

Couples a generative LLM (fine-tuned to predict next items) with a discriminative LLM (fine-tuned to verify user-item compatibility)
Uses the generator to create synthetic interactions specifically for long-tail items, effectively filling in the sparse interaction matrix
Uses the discriminator as a gatekeeper to filter out both noisy real clicks and low-quality synthetic hallucinations before training the downstream I2I model

Architecture

The overall LLM-I2I framework pipeline, showing the flow from original data to the LLM Generator, then to the LLM Discriminator, and finally fusing data for the I2I model.

Evaluation Highlights

+6.02% Recall Number (RN) and +1.22% GMV in online A/B testing on AliExpress (large-scale e-commerce platform)
Significantly outperforms baselines on Amazon datasets: +19.97% Recall@10 improvement for the Swing algorithm on Toys and Games
Boosts long-tail item recommendations: +93.88% Recall@10 for BPR on sparse data, demonstrating effectiveness where data is scarce

Breakthrough Assessment

7/10

Solid industrial application of LLMs for data augmentation. The 'generate-then-discriminate' pattern is robust, and the reported online gains on a major platform like AliExpress are significant.

⚙️ Technical Details

Problem Definition

Setting: Item-to-Item (I2I) Recommendation via Data Augmentation

Inputs: User historical behavior sequence Y_u = {item_1, ..., item_t}

Outputs: Augmented dataset containing high-quality synthetic user-item pairs for training downstream I2I models

Pipeline Flow

Data Generator (LLM synthesizes potential user-item interactions)
Data Discriminator (LLM filters generated & real data)
Downstream I2I Training (Train standard lightweight model on refined data)

System Modules

LLM-based Data Generator

Generate synthetic user-item interactions to enrich sparse data, specifically targeting long-tail items

Model or implementation: LLama2-7B-Chat (fine-tuned)

LLM-based Data Discriminator

Evaluate the confidence/quality of user-item pairs to remove noise

Model or implementation: LLama2-7B-Chat (fine-tuned)

Downstream I2I Model

Learn item similarities using the augmented dataset for real-time serving

Model or implementation: Various backbones (Swing, YoutubeDNN, BPR, BM25)

Novel Architectural Elements

Long-tail aware loss weighting in the generator to force focus on sparse items
Two-stage pipeline where a discriminator strictly filters the generator's output before it touches the downstream model

Modeling

Base Model: LLama2-7B-Chat

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Encourage generator to predict long-tail items more often.

Formally: Loss = - (1/T) * sum( α * log(P(item|context)) if item is long-tail else β * log(P(item|context)) )

Training Data:

Amazon Review Dataset (Beauty, Sports, Toys)
AliExpress Dataset (AEDS) - 10B interactions

Key Hyperparameters:

alpha (long-tail weight): 4.0
beta (short-tail weight): 1.0
batch_size: 16
+ 2 more
learning_rate: 5e-5
max_input_length: 1024

Compute: Tesla A100 GPU

Comparison to Prior Work

vs. LLM-CF: LLM-CF focuses on generating samples via CoT but lacks a dedicated discriminator to filter hallucinated/noisy synthetic data; LLM-I2I adds this filtering step.
vs. Pure Generative Augmentation: LLM-I2I explicitly weights the generation loss to favor long-tail items, preventing the LLM from only reinforcing popular biases.
vs. Heuristic Augmentation (e.g., random masking): LLM-I2I uses semantic understanding of items/users rather than random noise [not cited in paper].

Limitations

Relies on expensive LLM inference for data generation/discrimination offline (though online serving is lightweight)
Effectiveness depends heavily on the quality of the base LLM's world knowledge regarding specific e-commerce items
Discriminator training requires negative sampling which may introduce bias if not carefully managed

Reproducibility

Code availability is not provided in the paper. The method relies on proprietary industrial data (AEDS) for the main results, though public Amazon datasets are also used. Hyperparameters for the loss function (alpha=4.0) are explicitly stated.

📊 Experiments & Results

Evaluation Setup

Offline evaluation on public/industrial datasets + Online A/B testing

Benchmarks:

Amazon Review Dataset (ARD) (Sequential Recommendation / I2I Retrieval)
AliExpress Dataset (AEDS) (Large-scale E-commerce Recommendation)

Metrics:

Recall@K
NDCG@K
Gross Merchandise Value (GMV) [Online]
Recall Number (RN) [Online]
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LLM-I2I consistently improves the performance of various backbone models (Swing, YoutubeDNN, BPR, BM25) on the Amazon Toys and Games dataset.
Amazon Toys & Games	Recall@10	0.0616	0.0739	+0.0123
Amazon Toys & Games	Recall@10	0.0679	0.0739	+0.0060
Amazon Toys & Games	Recall@10	0.0103	0.0210	+0.0107
Online A/B testing on AliExpress shows real-world business impact.
AliExpress Live Traffic	Recall Number (RN)	100.00	106.02	+6.02
AliExpress Live Traffic	GMV	100.00	101.22	+1.22

Experiment Figures

Impact of synthetic data confidence levels and quantity on model performance (Swing algorithm).

Main Takeaways

Discriminator is critical: Performance drops if synthetic data is used without filtering, confirming that LLM hallucinations hurt model training.
Long-tail efficacy: The method provides disproportionately large gains for sparse data scenarios (e.g., +93-144% recall on long-tail items), validating the weighted loss function.
Universal enhancement: LLM-I2I boosts widely different algorithmic backbones (Graph-based Swing, Deep YoutubeDNN, Matrix Factorization BPR) without modifying their architecture.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (Item-to-Item algorithms)
Supervised Fine-Tuning (SFT) of LLMs
Long-tail distribution in recommender systems

Key Terms

I2I: Item-to-Item recommendation—algorithms that recommend items similar to those a user has interacted with (e.g., 'People who bought X also bought Y')

Swing: An industrial I2I algorithm that calculates item similarity based on the structure of user-item bipartite graphs, specifically focusing on co-occurrence substructures (like 'swing' patterns)

long-tail items: Products with very few historical interactions (purchases/clicks), making them difficult for algorithms to recommend accurately

GMV: Gross Merchandise Value—total value of merchandise sold over a given period

SFT: Supervised Fine-Tuning—retraining a pre-trained Large Language Model on a specific labeled dataset to adapt it to a new task

Recall@K: A metric measuring the proportion of relevant items found in the top K recommendations

NDCG@K: Normalized Discounted Cumulative Gain—a metric measuring ranking quality, giving higher scores to relevant items appearing higher in the list

RN: Recall Number—the total number of unique items successfully retrieved/recommended by the system

BPR: Bayesian Personalized Ranking—a matrix factorization objective that optimizes the relative order of items (preferred > non-preferred) rather than predicting raw ratings