← Back to Paper List

Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs

Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, Zirui Tang, Bin Wang, Shen Tang, Kai Zuo, Yuyu Luo, Zhenzhe Zheng, Conghui He, Jingren Zhou, Fan Wu
Shanghai Jiao Tong University, Tsinghua University, Microsoft Research, Shanghai AI Laboratory, Xiaohongshu Inc., The Hong Kong University of Science and Technology
arXiv (2026)
Agent RAG Benchmark

📝 Paper Summary

LLM-based Data Management Data Preparation Agents
This survey characterizes the paradigm shift in data preparation from static rule-based pipelines to flexible, LLM-enhanced agentic workflows that leverage semantic reasoning for cleaning, integration, and enrichment.
Core Problem
Traditional data preparation relies on rigid rules and domain-specific models that require high manual effort, lack semantic understanding for ambiguous cases, and fail to generalize across diverse modalities.
Why it matters:
  • Data inefficiencies (consistency, isolation, semantic limitations) cause an estimated 20–30% revenue loss in enterprises.
  • Existing rule-based systems (e.g., regex) struggle with semantic ambiguities like synonyms or domain jargon.
  • Task-specific ML models require expensive labeled data and feature engineering, limiting scalability to new domains.
Concrete Example: In entity matching, traditional similarity metrics fail to link 'IBM' with 'International Business Machines' due to syntactic differences. An LLM-enhanced approach uses semantic knowledge to resolve this abbreviation without explicit hard-coded rules.
Key Novelty
Taxonomy of LLM-Enhanced Data Preparation
  • Categorizes the field into three core tasks: Cleaning (removing errors), Integration (combining sources), and Enrichment (adding insights), highlighting the shift to agentic workflows.
  • Identifies a transition from manual 'detect-then-correct' rules to instruction-driven, context-aware agents that autonomously plan and reflect on data transformations.
  • Emphasizes hybrid methodologies where LLMs generate executable code (e.g., Python/SQL) or transfer reasoning to smaller models to balance cost and performance.
Evaluation Highlights
  • Review of hundreds of recent works across data cleaning, integration, and enrichment tasks.
  • Identifies a major shift toward hybrid LLM-ML systems to reduce inference costs while maintaining semantic reasoning capabilities.
  • Highlights that while agentic workflows are emerging, few robust implementations currently exist in practice.
Breakthrough Assessment
7/10
Provides a comprehensive and structured taxonomy of a rapidly evolving field, clearly articulating the shift from rule-based to agentic systems, though it is a survey rather than a new method proposal.
×