A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models

📝 Paper Summary

Entity Matching (EM) Cross-Dataset Generalization LLM Evaluation Cost-Efficiency Analysis

Fine-tuned small language models (like LLaMA-1B) can match the accuracy of massive commercial LLMs (like GPT-4) in cross-dataset entity matching while costing orders of magnitude less to deploy.

Core Problem

Cross-dataset entity matching requires identifying matching records in unseen target datasets without labeled training examples or reliable schema information, a scenario where traditional supervised methods fail.

Why it matters:

Crucial for automated cloud data integration services (e.g., AWS Glue) that process millions of heterogeneous tables without forcing users to manually label data
Current LLM-based solutions are often evaluated on limited datasets (e.g., only e-commerce) without considering domain shifts or deployment costs
High inference costs of commercial LLMs (e.g., GPT-4) make them impractical for large-scale data cleaning pipelines that process millions of records

Concrete Example: A matcher trained on Amazon electronics data must identify duplicates in a restaurant dataset without seeing any restaurant examples. An overly specific small model might fail on the unseen vocabulary, while a large model might be too expensive to run on the entire restaurant database.

Key Novelty

Systematic Cross-Dataset Evaluation & Cost-Quality Analysis

Implements a 'leave-one-dataset-out' evaluation protocol across 11 diverse benchmarks to rigorous test generalization to unseen domains
Directly compares fine-tuned small models (SLMs) against prompted large models (LLMs) on both accuracy and varying hardware/cost configurations
Quantifies the dollar-cost-per-token vs. F1 score trade-off, revealing that SLMs can be 4000x cheaper for similar performance

Architecture

Illustration of the 'Leave-one-dataset-out' evaluation strategy.

Evaluation Highlights

AnyMatch [LLaMA3.2-1B] achieves 87.5 average F1, matching the performance of MatchGPT [GPT-4] (87.4 F1) despite having ~1000x fewer parameters
Ditto (110M params) is 4,838 times cheaper per 1k tokens than MatchGPT [GPT-4] while still outperforming GPT-3.5-Turbo on average F1 (72.9 vs 66.0)
Prompting LLMs with demonstrations (few-shot) degrades performance in cross-dataset settings for smaller models like GPT-4o-mini compared to zero-shot prompting

Breakthrough Assessment

7/10

Comprehensive empirical study that challenges the trend of using massive LLMs for everything. While algorithmic novelty is low, the rigorous benchmarking and cost analysis provide valuable practical guidance for the field.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of record pairs (r_l, r_r) from unseen target relations R_left and R_right

Inputs: Serialized string representation of record pairs from an unseen target dataset (no target labels available)

Outputs: Binary prediction {0, 1} indicating whether the pair refers to the same real-world entity

Pipeline Flow

Input Serialization (convert record pair to string)
Model Processing (Fine-tuned SLM or Prompted LLM)
Prediction (Match/No-match)

System Modules

Serializer

Converts tabular record pairs into string sequences

Model or implementation: Rule-based

Matcher

Classifies the serialized string as match or non-match

Model or implementation: Varied (BERT, RoBERTa, T5, LLaMA, GPT-4)

Modeling

Base Model: Evaluated multiple: BERT, DeBERTa, GPT-2, T5, LLaMA-3.2-1B, LLaMA-2-13B, Mixtral-8x7B, SOLAR-10.7B, Beluga2-70B, GPT-3.5/4

Training Method: Transfer Learning via Fine-tuning (for SLMs) and Prompting (for LLMs)

Objective Functions:

Purpose: Minimize classification error on source datasets.

Formally: Standard Cross-Entropy Loss for classification (Ditto/Unicorn) or Next-Token Prediction Loss (AnyMatch).

Adaptation: Full fine-tuning for SLMs (Ditto, Unicorn, AnyMatch); Inference-only prompting for LLMs

Trainable Parameters: Ranging from 110M (BERT) to 1.3B (LLaMA-3.2) for fine-tuned models

Training Data:

11 benchmark datasets (ABT, WDC, DBAC, etc.)
Leave-one-out split: Train on 10, Test on 1

Key Hyperparameters:

learning_rate: 1e-6 (for LLaMA-3.2)
batch_size: Varies by model size (e.g., 8192 for BERT vs 32 for Beluga2 on 4xA100)

Compute: Inference measured on 4x NVIDIA A100 (40GB) GPUs. GPT-4/3.5 accessed via API.

Comparison to Prior Work

vs. MatchGPT: Includes cost analysis and fine-tuned SLM comparison; strict cross-dataset protocol
vs. Ditto/Unicorn: Evaluates generalization to completely unseen datasets rather than just held-out sets from same domain
vs. ZeroER: Evaluates deep learning approaches (SLMs/LLMs) rather than just statistical generative models

Limitations

Evaluation limited to binary matching; does not cover multi-class or clustering
Commercial LLM training data is opaque; unknown if test sets were in pre-training corpus
Did not evaluate proprietary fine-tuned models (TableGPT) due to lack of access

Reproducibility

Code: https://github.com/Jantory/cross-dataset-em-study

publicly available (https://github.com/Jantory/cross-dataset-em-study). Code organizes baselines into a unified framework. Proprietary baselines (TableGPT) and deprecated models (GPT-3) could not be reproduced.

📊 Experiments & Results

Evaluation Setup

Leave-one-dataset-out cross-validation across 11 datasets. Fine-tune on 10, test on 1 unseen target.

Benchmarks:

Magellan / WDC / DeepMatcher datasets (Entity Matching (Structured Data))

Metrics:

F1 Score (Macro-averaged)
Throughput (tokens/second)
Cost ($ per 1K tokens)
Statistical methodology: Reported mean and standard deviation over 5 random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance of fine-tuned SLMs vs. Prompted LLMs in Cross-Dataset setting.
Average across 11 datasets	Mean F1	87.4	87.5	+0.1
Average across 11 datasets	Mean F1	66.0	72.9	+6.9
Inference on 4xA100	Tokens/sec	1079	862001	+860922
Deployment Cost	$ per 1K tokens	0.015	0.0000031	-0.0149969

Experiment Figures

Plot of Prediction Quality vs. Model Size (Parameters).

Main Takeaways

Fine-tuned small models (SLMs) like LLaMA-1B are highly competitive, matching GPT-4 performance in cross-dataset settings.
Data-centric approaches (AnyMatch) outperform model-centric architectural modifications (Unicorn/Ditto).
Prompting with demonstrations (few-shot) from other datasets often hurts performance for smaller LLMs compared to zero-shot, likely due to distribution shifts.
Overlapping domains in transfer datasets (e.g., matching on Restaurants A after training on Restaurants B) did not statistically significantly improve performance compared to non-overlapping domains.

📚 Prerequisite Knowledge

Prerequisites

Entity Matching / Record Linkage concepts
Transformer architectures (Encoder vs. Decoder)
Fine-tuning vs. Prompting paradigms

Key Terms

Entity Matching (EM): The process of identifying records in different datasets that refer to the same real-world entity

Cross-dataset EM: An EM setting where the model must predict matches on a target dataset it has never seen during training

Leave-one-dataset-out: An evaluation strategy where a model is trained on N-1 datasets and tested on the Nth unseen dataset

Blocking: A pre-processing step to reduce the number of candidate pairs for matching (not the focus of this paper, but context for it)

Serialization: Converting structured table records (columns/values) into a single string sequence for LM input

Zero-shot: Making predictions without any specific training examples from the target task/domain

In-context learning: Providing examples (demonstrations) in the prompt to guide the LLM's behavior