LLM-Enhanced Reranking for Complementary Product Recommendation

📝 Paper Summary

Complementary Product Recommendation LLM-based Reranking

A model-agnostic framework that uses multi-agent LLM prompting to rerank complementary product candidates retrieved by graph neural networks, first enhancing diversity and then refining for accuracy.

Core Problem

Graph Neural Networks (GNNs) used for recommendation often favor popular, highly connected items, failing to capture the semantic nuance required for diverse, long-tail complementary products.

Why it matters:

Recommending complementary items (e.g., lens for a camera) drives significant e-commerce value but requires understanding functional relationships beyond simple co-occurrence.
Existing GNN approaches struggle with the accuracy-diversity tradeoff, often recommending repetitive or obvious items while missing novel but relevant complements.
Prior LLM integration methods typically require expensive retraining of the base recommender to incorporate augmented data.

Concrete Example: When recommending for an 'iPhone', a standard model might suggest popular 'iPhone' substitutes or generic accessories. The proposed system uses an LLM to explicitly reason that 'iPhone Case' is an accessory and 'Speaker Cables' complement 'Speaker Stands', reranking these semantically relevant but potentially less connected items higher.

Key Novelty

Two-Stage Multi-Agent LLM Reranking

Decomposes the reranking task into two sequential LLM agents: a 'Diversity Agent' that prioritizes different product genres from the candidate list, and an 'Accuracy Agent' that filters the result for strict relevance.
Utilizes a model-agnostic 'retrieve-then-rerank' pipeline where the LLM operates solely on textual metadata (titles) of candidates retrieved by any base GNN, avoiding model retraining.

Evaluation Highlights

Achieves ~22% improvement in Hit@1 on the Cell Phones dataset using SComGNN as the base retriever (1.087 → 1.326).
Improves diversity metrics (vocabulary size) by over 10% on Cell Phones with GraphSAGE base (19.5 → 21.2) while simultaneously boosting accuracy.
Consistently improves NDCG@1 across all four datasets (Cell Phones, Electronics, Grocery, Home) when applied to GraphSAGE, GAT, and SComGNN baselines.

Breakthrough Assessment

6/10

Effective application of LLMs to the specific problem of complementary recommendation with a clean, model-agnostic design. While the architectural novelty is moderate (prompt-based reranking), the demonstrated balance of accuracy and diversity is valuable.

⚙️ Technical Details

Problem Definition

Setting: Link prediction in a complementary product graph G = {V, X, E}

Inputs: A query item v_i and a set of candidate items retrieved by a base model

Outputs: A reranked list of candidate items ordered by complementary relevance

Pipeline Flow

Functional Group: Retrieval -> Base Recommendation Model (GNN)
Functional Group: Reranking -> Diversity Agent (LLM)
Functional Group: Reranking -> Accuracy Agent (LLM)

System Modules

Base Recommendation Model

Retrieve initial candidate list of complementary products based on graph embeddings

Model or implementation: GraphSAGE / GAT / SComGNN

Diversity Agent (Reranking)

Rerank candidates to prioritize items with different 'genres' or types

Model or implementation: Llama3.3-70B (accessed via prompting)

Accuracy Agent (Reranking)

Refine the diversified list to ensure high precision and relevance

Model or implementation: Llama3.3-70B (accessed via prompting)

Novel Architectural Elements

Sequential two-agent reranking pipeline (Diversity Agent -> Accuracy Agent) designed to explicitly manage the accuracy-diversity tradeoff without retraining the base model.

Modeling

Base Model: Llama3.3-70B (for agents) / GraphSAGE, GAT, SComGNN (for retrieval)

Training Method: Inference-only prompting (Zero-shot / Few-shot)

Key Hyperparameters:

diversity_agent_input_k: 50
accuracy_agent_input_k: 25

Compute: Not reported in the paper

Comparison to Prior Work

vs. Lyu et al. (2023) and Wei et al. (2024): The proposed method is model-agnostic and does not require retraining the GNN, whereas prior works require retraining with augmented data.
vs. Hou et al. (2024): Specifically adapts reranking for the complementary product domain using a dual-agent (Diversity + Accuracy) approach rather than a single generic reranker.
vs. Simple GNNs: Adds a semantic reasoning layer (LLM) that captures relationships (e.g., accessory vs. substitute) missed by structural embeddings.

Limitations

Relies on the quality of the initial candidate generation; if the GNN retriever misses relevant items entirely, the reranker cannot recover them.
Inference latency is likely higher due to LLM calls compared to pure GNN inference (though latency is not explicitly quantified).
Accuracy agent tends to reduce diversity metrics gained in the previous step, enforcing a tradeoff rather than a pure 'free lunch'.
No statistical significance tests reported.

Reproducibility

Code: https://anonymous.4open.science/r/llm_rerank-4B01/README.md

Code is publicly available at https://anonymous.4open.science/r/llm_rerank-4B01/README.md. The paper provides the full prompt templates for both Diversity and Accuracy agents. Amazon product datasets are public.

📊 Experiments & Results

Evaluation Setup

Link prediction on Amazon product graphs (Electronics, Cell Phones, Grocery, Home).

Benchmarks:

Amazon Electronics (Complementary Product Recommendation)
Amazon Cell Phones (Complementary Product Recommendation)
Amazon Grocery (Complementary Product Recommendation)
Amazon Home (Complementary Product Recommendation)

Metrics:

Hit@K (Accuracy)
NDCG@K (Ranking Quality)
Entropy (Diversity of token distribution)
Vocabulary Size (Diversity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on Amazon Cell Phones dataset showing improvement from the proposed Div.+Acc. pipeline over baselines.
Amazon Cell Phones	Hit@1	1.154	1.351	+0.197
Amazon Cell Phones	Hit@1	1.087	1.326	+0.239
Amazon Cell Phones	NDCG@1	1.087	1.326	+0.239
Amazon Cell Phones	Vocabulary Size	19.5	20.8	+1.3
Results on Amazon Home dataset demonstrating gains in ranking quality.
Amazon Home	Hit@1	3.383	3.704	+0.321
Amazon Home	NDCG@1	3.354	3.564	+0.210

Experiment Figures

Lift percentage in accuracy and diversity metrics by dataset and method.

Main Takeaways

The Diversity Agent alone improves both accuracy and diversity metrics at lower K values (K=1), suggesting that diversifying top recommendations also helps retrieve relevant items missed by the GNN.
The Accuracy Agent further boosts precision metrics (Hit Rate, NDCG) but consistently reduces diversity metrics (Entropy, Vocab) compared to the Diversity Agent's output, confirming the accuracy-diversity tradeoff.
The method is robust across different underlying GNN architectures (GraphSAGE, GAT, SComGNN), delivering consistent gains without retraining the base models.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs) for recommendation
Information Retrieval metrics (NDCG, Hit Rate)
Large Language Models (LLMs) and prompting strategies

Key Terms

GNN: Graph Neural Network—a deep learning model that processes data represented as graphs

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

Complementary Product: An item used together with another to enhance value (e.g., camera and lens), distinct from a substitute

Zero-shot/Few-shot prompting: Providing an LLM with zero or a few examples of a task to guide its performance without updating its weights

GraphSAGE: A GNN framework that generates embeddings by sampling and aggregating features from a node's local neighborhood

GAT: Graph Attention Network—a GNN that uses attention mechanisms to weigh the importance of neighboring nodes

SComGNN: Spectral-based Complementary Graph Neural Network—a specialized GNN for complementary product recommendation