Evaluation Setup
Leave-one-dataset-out cross-validation across 11 datasets. Fine-tune on 10, test on 1 unseen target.
Benchmarks:
- Magellan / WDC / DeepMatcher datasets (Entity Matching (Structured Data))
Metrics:
- F1 Score (Macro-averaged)
- Throughput (tokens/second)
- Cost ($ per 1K tokens)
- Statistical methodology: Reported mean and standard deviation over 5 random seeds
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparative performance of fine-tuned SLMs vs. Prompted LLMs in Cross-Dataset setting. |
| Average across 11 datasets |
Mean F1 |
87.4 |
87.5 |
+0.1
|
| Average across 11 datasets |
Mean F1 |
66.0 |
72.9 |
+6.9
|
| Inference on 4xA100 |
Tokens/sec |
1079 |
862001 |
+860922
|
| Deployment Cost |
$ per 1K tokens |
0.015 |
0.0000031 |
-0.0149969
|
Main Takeaways
- Fine-tuned small models (SLMs) like LLaMA-1B are highly competitive, matching GPT-4 performance in cross-dataset settings.
- Data-centric approaches (AnyMatch) outperform model-centric architectural modifications (Unicorn/Ditto).
- Prompting with demonstrations (few-shot) from other datasets often hurts performance for smaller LLMs compared to zero-shot, likely due to distribution shifts.
- Overlapping domains in transfer datasets (e.g., matching on Restaurants A after training on Restaurants B) did not statistically significantly improve performance compared to non-overlapping domains.