Evaluation on Entity Matching in Recommender Systems

📝 Paper Summary

Recommender Systems Entity Matching Conversational Recommender Systems (CRS)

Reddit-Amazon-EM is a manually annotated benchmark linking unstructured movie mentions from Reddit conversations to structured Amazon catalog entries, demonstrating that graph-based and LLM-driven methods significantly outperform traditional lexical matching.

Core Problem

Conversational Recommender Systems (CRS) struggle to link ambiguous, informal item mentions in user queries (like Reddit posts) to structured catalog entities (like Amazon products) due to a lack of rigorous cross-dataset evaluation benchmarks.

Why it matters:

In-the-wild user queries lack structured metadata, hindering the development of knowledge-aware recommender systems that need accurate grounding
Current CRS studies use ad-hoc matching (Fuzzy, BM25) without rigorous evaluation, leading to unreliable recommendations
There is no consensus on which Entity Matching (EM) methods work best for linking diverse data formats like social media discussions to product catalogs

Concrete Example: A Reddit user mentions 'Prisoners (2013)'. A simple text matcher might incorrectly link this to 'PRISONER' or 'Prison (Collector’s Edition)', whereas the correct Amazon entries are specific formats like 'Prisoners [DVD] (2013)' or 'Prisoners (Blu-ray+DVD)'.

Key Novelty

Reddit-Amazon-EM Benchmark

Constructs a gold-standard dataset by manually annotating matches between informal Reddit movie titles and structured Amazon product entries, filtering out metadata mismatches like wrong years
systematically evaluates five classes of Entity Matching methods (lexical, vector, hybrid, graph-based, LLM-based) on this specific cross-platform linking task

Architecture

The data construction pipeline for Reddit-Amazon-EM, illustrating the flow from raw data to annotated gold set

Evaluation Highlights

Graph-based GNEM achieves state-of-the-art performance with 96.29% F1, significantly outperforming traditional BM25 (78.43% F1) and Faiss (60.51% Precision at best F1 threshold)
LLM-based ComEM follows closely with 94.02% F1, showing that semantic understanding beats lexical matching but still lags behind graph-based structural matching in precision
In downstream CRS tasks, GNEM maintains the lead with 7.84% Recall@5 when used with GPT-3.5, validating the benchmark's relevance to real-world recommendation scenarios

Breakthrough Assessment

8/10

Provides a much-needed rigorous benchmark for a critical but neglected component of CRS. The manual annotation of >4k items is a significant resource contribution.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of candidate mention-entry pairs

Inputs: A set of entity mentions M from Reddit and a structured knowledge base K (Amazon products)

Outputs: A binary decision for each pair (m_i, e_j) indicating if they refer to the same real-world entity

Pipeline Flow

Candidate Retrieval (Blocking)
Pair Scoring
Classification (Thresholding)

System Modules

Candidate Retrieval

Select a subset of potential Amazon matches for each Reddit mention to reduce computational cost

Model or implementation: Varies by baseline (e.g., BM25, Faiss, or LLM-based retrieval)

Scoring Function

Compute a similarity score between a mention and a candidate entity

Model or implementation: Varies (e.g., Graph GCN for GNEM, BERT+Fuzzy for Hybrid, GPT for ComEM)

Thresholding

Convert similarity scores into binary match/no-match decisions

Model or implementation: Threshold τ

Modeling

Base Model: Varies by baseline (BERT for embeddings, GPT-3.5/4 for ComEM/Annotation)

Training Method: Supervised learning for GNEM and Hybrid methods using the annotated dataset

Objective Functions:

Purpose: Binary classification of entity pairs.

Formally: Standard binary cross-entropy or margin-based ranking losses depending on the specific baseline method (details implicit in baseline references).

Adaptation: Fine-tuning of specific EM models (GNEM, Hybrid) on the Reddit-Amazon-EM training split

Trainable Parameters: Varies (GNN weights for GNEM, Linear layers for Hybrid)

Training Data:

Training: 30,124 pairs
Validation: 7,532 pairs
Test: 9,414 pairs
Negatives: 1:10 positive-to-negative ratio using hard negatives (rejected candidates) and random negatives

Key Hyperparameters:

embedding_model: all-MiniLM-L6-v2 (for Faiss baseline)
negative_multiplier: 9 (for training set construction)

Compute: Training/Inference performed on A6000 GPUs and EPYC 7702 CPUs. GNEM requires substantial training time but efficient inference.

Comparison to Prior Work

vs. Traditional EM (BM25/Fuzzy): Reddit-Amazon-EM evaluates on cross-platform informal-to-formal text, showing traditional methods fail at semantic nuances
vs. Existing CRS Datasets (ReDial, etc.): Existing datasets rely on crowd-workers role-playing; this work links naturally occurring in-the-wild Reddit conversations to a catalog
vs. Magellan [not cited in paper]: Focuses specifically on the recommender system domain with movie titles, rather than general tabular data matching

Limitations

Annotation process is resource-intensive and may not scale easily to other domains without further human effort
Evaluation focuses primarily on movie domain (Reddit Movies -> Amazon Movies)
Smaller LLMs (e.g., Qwen3-4b) show significant degradation in downstream CRS tasks, limiting applicability of findings to high-end models

Reproducibility

Code: https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching

Publicly available: Dataset (Reddit-Amazon-EM), annotated gold set, and evaluation code. Not provided: Weights for trained baseline models (though code to retrain is implied). Dependencies: Requires OpenAI API for reproducing LLM-based baselines (ComEM, CRS case study).

📊 Experiments & Results

Evaluation Setup

Cross-dataset entity matching between Reddit conversation mentions and Amazon product catalog

Benchmarks:

Reddit-Amazon-EM (Entity Matching / Linkage) [New]
Downstream CRS with LLMs (Conversational Recommendation)

Metrics:

F1
Accuracy
Recall@k
Precision@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Benchmark results on the Reddit-Amazon-EM dataset showing the superiority of graph-based and LLM-based methods over traditional baselines.
Reddit-Amazon-EM	F1	78.43	96.29	+17.86
Reddit-Amazon-EM	F1	72.29	96.29	+24.00
Reddit-Amazon-EM	F1	86.68	96.29	+9.61
Reddit-Amazon-EM	F1	94.02	96.29	+2.27
Downstream CRS evaluation measuring how well different EM methods retrieve ground-truth movies mentioned in LLM-generated recommendations.
LLM-based CRS (GPT-3.5 responses)	Recall@5	7.22	7.84	+0.62

Main Takeaways

Hybrid methods combining structural and semantic signals (GNEM, ComEM) consistently outperform standalone lexical or vector retrieval (BM25, Faiss)
Graph-based GNEM achieves the best balance of precision and recall, effectively distinguishing between similar product variations (e.g., DVD vs Blu-ray)
Faiss (vector retrieval) suffers from high recall but low precision (60.51%), often retrieving semantically related but incorrect items
Performance gaps narrow in the downstream conversational task, suggesting that conversational noise and LLM hallucinations act as a bottleneck, dampening the benefits of superior EM methods

📚 Prerequisite Knowledge

Prerequisites

Entity Matching / Record Linkage concepts
Basics of Conversational Recommender Systems
Familiarity with Retrieval metrics (Recall, Precision, F1)

Key Terms

CRS: Conversational Recommender Systems—systems that recommend items through interactive dialogue rather than static lists

Entity Matching (EM): The task of identifying records that refer to the same real-world entity across different data sources (also known as Record Linkage)

Blocking: A preliminary step in entity matching to reduce the search space by selecting a candidate set of potential matches for detailed scoring

GNEM: Graph Neural Entity Matching—a method that uses graph neural networks to capture structural and semantic relationships between records

ComEM: An LLM-based entity matching framework that uses Large Language Models for candidate retrieval and selection

Recall@k: The proportion of relevant items found in the top-k retrieved results

BM25: A probabilistic retrieval function used to rank documents based on the query terms appearing in each document