Evaluation Setup
Survey of existing literature methods across three primary tasks.
Benchmarks:
- Data Cleaning Benchmarks (Standardization, Error Repair)
- Entity Matching Benchmarks (Record Linkage)
- Schema Matching Benchmarks (Column Alignment)
Metrics:
- Accuracy (Precision/Recall/F1)
- Generalization capability
- Manual effort reduction
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Shift toward cost-efficient hybrid methods: LLMs generate code or distill knowledge to SLMs rather than processing every row directly.
- Reduced emphasis on fine-tuning: RAG and in-context learning are preferred over maintaining task-specific models.
- Cross-modal generalization: Modern methods use unified representations for text and tables, reducing modality-specific engineering.
- Agentic implementations are promising but still in early stages compared to prompt-based techniques.