VoteGCL: Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation

📝 Paper Summary

Graph-based Recommendation Systems LLM-based Data Augmentation

VoteGCL augments graph recommendation data by using Large Language Models to repeatedly rerank candidate items and aggregating results via majority voting to ensure high-confidence synthetic interactions.

Core Problem

Graph-based recommendation systems suffer from data sparsity and popularity bias, while existing LLM-based augmentation methods produce inconsistent results due to stochastic generation and misaligned embeddings.

Why it matters:

Data sparsity limits the effectiveness of collaborative filtering, leading to poor recommendations for users with few interactions (cold start)
Directly using LLM-generated embeddings often causes distributional shifts that degrade performance when integrated with collaborative signals
Existing LLM augmentation is unstable; single inference runs yield fluctuating results (e.g., varying NDCG scores on Netflix) due to the probabilistic nature of LLMs

Concrete Example: When prompting an LLM to predict user preferences, one run might rank 'Inception' high while another ranks it low due to randomness. Existing methods using a single pass introduce noise. VoteGCL runs the ranking N times; if 'Inception' appears at the top in most runs, it is reliably added as a synthetic edge, reducing noise.

Key Novelty

VoteGCL: Majority-Voting LLM-Rerank Augmentation

Reformulates augmentation as a reranking task where an LLM orders candidate items multiple times, aggregating results via a simplified Reciprocal Rank Fusion (RRF) to filter out stochastic noise
Integrates these high-confidence synthetic interactions into a Graph Contrastive Learning framework, aligning the original and augmented graph views to mitigate popularity bias without needing complex embedding alignment

Architecture

The overall VoteGCL framework, illustrating the two-stage process: Data Augmentation via Majority-Vote Reranking and Graph Contrastive Learning.

Evaluation Highlights

Outperforms state-of-the-art baselines (e.g., LightGCN, SimGCL) on Netflix dataset with +5.79% improvement in NDCG@20
Reduces popularity bias significantly, lowering popularity consumption by ~40% on the Amazon Book dataset compared to LightGCN
Demonstrates robustness to noise, maintaining performance gains even as the number of voting rounds (N) increases, validating the theoretical concentration of measure guarantees

Breakthrough Assessment

7/10

A solid methodological improvement that addresses the specific instability of LLM generation in RS. The theoretical grounding via concentration of measure adds rigor, though the core components (LLM reranking + Contrastive Learning) are established techniques combined in a novel way.

⚙️ Technical Details

Problem Definition

Setting: Graph-based recommendation on a bipartite user-item graph G=(V, E)

Inputs: User-item interaction graph, item textual descriptions, user history

Outputs: Recommended items for users based on learned node embeddings

Pipeline Flow

Candidate Retrieval (LightGCN)
LLM Reranking (Multiple Independent Runs)
Majority Vote Aggregation (RRF)
Graph Augmentation
Contrastive Learning Training

System Modules

Candidate Retriever (Retrieval & Selection)

Generate initial shortlist of K candidate items for low-degree users

Model or implementation: LightGCN

LLM Reranker (Retrieval & Selection)

Rerank candidate items based on user history and item metadata using few-shot prompting

Model or implementation: gpt-3.5-turbo

Vote Aggregator (Retrieval & Selection)

Aggregate N ranked lists into a final score using Reciprocal Rank Fusion

Model or implementation: Deterministic Algorithm (RRF)

Graph Encoder

Learn node embeddings using contrastive learning between original and augmented graphs

Model or implementation: LightGCN (shared weights)

Novel Architectural Elements

Integration of majority-vote LLM reranking directly into the graph structure construction phase rather than embedding space
Use of concentration of measure theory to define the aggregation mechanism for synthetic edge generation

Modeling

Base Model: LightGCN (Graph Encoder), gpt-3.5-turbo (Augmentor)

Training Method: Graph Contrastive Learning with BPR Loss

Objective Functions:

Purpose: Optimize recommendation accuracy by ensuring observed interactions score higher than unobserved ones.

Formally: BPR Loss (Bayesian Personalized Ranking)
Purpose: Mitigate distributional shift and ensure robustness by maximizing similarity between original and augmented graph views.

Formally: InfoNCE Contrastive Loss

Training Data:

Augmentation applied only to low-degree users (bottom 25th percentile of interactions)
Datasets: Amazon Book, Amazon Scientific, MovieLens-1M, Netflix, Yelp2018

Key Hyperparameters:

contrastive_loss_weight_lambda: 0.2 (Netflix), 0.1 (others)
temperature_tau: 0.2
learning_rate: 0.001
+ 4 more
batch_size: 2048
embedding_dimension: 64
voting_rounds_N: 10 (optimal)
candidates_K: 20

Compute: Experiments run on NVIDIA A100 GPU

Comparison to Prior Work

vs. LLMRec: VoteGCL uses majority voting to filter noise instead of single-pass augmentation, and augments graph structure rather than just embeddings
vs. SimGCL: Uses semantically meaningful augmentations (LLM-derived) instead of random noise perturbation
vs. TIGER [not cited in paper]: TIGER generates item IDs using quantization; VoteGCL keeps original IDs and augments edges via reranking

Limitations

Relies on external commercial LLM APIs (GPT-3.5), incurring cost and latency
Augmentation is limited to low-degree users (bottom 25%), potentially missing gains for medium-degree users
Inference time for majority voting scales linearly with N (number of voting rounds), increasing computational overhead during data preparation

Reproducibility

Prompt templates and theoretical proofs are provided in the appendix. Specific code URL is not provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Top-K Recommendation on sparse datasets

Benchmarks:

Netflix (Movie Recommendation)
Yelp2018 (Business Recommendation)
Amazon Book (Product Recommendation)
MovieLens-1M (Movie Recommendation)

Metrics:

Recall@20
NDCG@20
APLT@20 (Average Percentage of Long Tail items)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VoteGCL consistently outperforms baselines on Netflix across accuracy metrics.
Netflix	Recall@20	0.1554	0.1652	+0.0098
Netflix	NDCG@20	0.0863	0.0913	+0.0050
Popularity bias mitigation results on Amazon Book.
Amazon Book	Popularity Consumption (Pop)	450	270	-180
Amazon Book	APLT@20	0.1764	0.2250	+0.0486

Experiment Figures

Fluctuation of NDCG@10 scores for LLMRec and Llama4Rec across 5 independent runs on Netflix.

Comparison of Average Popularity Consumption (Pop) and APLT@20 across different user groups (split by degree) on Amazon Book.

Main Takeaways

Majority voting significantly stabilizes LLM augmentation; performance improves as voting rounds (N) increase from 1 to 10, then plateaus.
The method is particularly effective for sparse datasets (like Netflix and Yelp), showing larger relative gains compared to dense ones.
VoteGCL successfully mitigates popularity bias, recommending more long-tail items compared to LightGCN and SimGCL, as evidenced by higher APLT scores.
The contrastive learning component is crucial; removing it (using only augmented graph for training) leads to worse performance due to distributional shift.

📚 Prerequisite Knowledge

Prerequisites

Graph Convolutional Networks (GCNs)
Contrastive Learning
Large Language Models (LLMs) prompting
Collaborative Filtering

Key Terms

LightGCN: A simplified Graph Convolutional Network for recommendation that removes non-linearities and feature transformations to focus on neighborhood aggregation

Graph Contrastive Learning (GCL): A self-supervised learning approach that maximizes agreement between differently augmented views of the same graph to learn robust representations

RRF: Reciprocal Rank Fusion—a method to combine multiple ranked lists by summing the reciprocal of the rank of each item

Concentration of Measure: A theoretical principle stating that random variables (like LLM rankings) tend to cluster around their expected value as the number of independent trials increases

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the recommendation list

Popularity Bias: The tendency of recommendation models to favor popular items over less popular ones, often ignoring niche user interests

Data Sparsity: The condition where the user-item interaction matrix has very few observed entries relative to the total possible interactions