Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs

📝 Paper Summary

LLM-based Data Management Data Preparation Agents

This survey characterizes the paradigm shift in data preparation from static rule-based pipelines to flexible, LLM-enhanced agentic workflows that leverage semantic reasoning for cleaning, integration, and enrichment.

Core Problem

Traditional data preparation relies on rigid rules and domain-specific models that require high manual effort, lack semantic understanding for ambiguous cases, and fail to generalize across diverse modalities.

Why it matters:

Data inefficiencies (consistency, isolation, semantic limitations) cause an estimated 20–30% revenue loss in enterprises.
Existing rule-based systems (e.g., regex) struggle with semantic ambiguities like synonyms or domain jargon.
Task-specific ML models require expensive labeled data and feature engineering, limiting scalability to new domains.

Concrete Example: In entity matching, traditional similarity metrics fail to link 'IBM' with 'International Business Machines' due to syntactic differences. An LLM-enhanced approach uses semantic knowledge to resolve this abbreviation without explicit hard-coded rules.

Key Novelty

Taxonomy of LLM-Enhanced Data Preparation

Categorizes the field into three core tasks: Cleaning (removing errors), Integration (combining sources), and Enrichment (adding insights), highlighting the shift to agentic workflows.
Identifies a transition from manual 'detect-then-correct' rules to instruction-driven, context-aware agents that autonomously plan and reflect on data transformations.
Emphasizes hybrid methodologies where LLMs generate executable code (e.g., Python/SQL) or transfer reasoning to smaller models to balance cost and performance.

Evaluation Highlights

Review of hundreds of recent works across data cleaning, integration, and enrichment tasks.
Identifies a major shift toward hybrid LLM-ML systems to reduce inference costs while maintaining semantic reasoning capabilities.
Highlights that while agentic workflows are emerging, few robust implementations currently exist in practice.

Breakthrough Assessment

7/10

Provides a comprehensive and structured taxonomy of a rapidly evolving field, clearly articulating the shift from rule-based to agentic systems, though it is a survey rather than a new method proposal.

⚙️ Technical Details

Problem Definition

Setting: Transforming raw dataset D into high-quality dataset D_out via Cleaning, Integration, and Enrichment functions.

Inputs: Raw datasets containing noise, inconsistencies, missing values, or isolated schemas.

Outputs: Cleaned, integrated, and enriched datasets suitable for downstream analytics or ML training.

Pipeline Flow

Data Cleaning (Standardization, Error Processing, Imputation)
Data Integration (Entity Matching, Schema Matching)
Data Enrichment (Annotation, Profiling)

System Modules

Data Cleaning Agent

Orchestrates cleaning workflows by identifying issues and invoking tools

Model or implementation: LLM (e.g., GPT-4) or Hybrid LLM-ML

Entity Matcher (Data Integration)

Decides if record pairs represent the same entity using semantic reasoning

Model or implementation: LLM with structured prompting or Code-based reasoning

Schema Matcher (Data Integration)

Maps attributes between source and target schemas

Model or implementation: RAG-enhanced LLM or Agent-based planner

Data Profiler

Generates semantic metadata and summaries

Model or implementation: LLM with task-specific prompts

Novel Architectural Elements

Unified taxonomy organizing methods into Cleaning, Integration, and Enrichment
Conceptualization of the shift from 'Model-Specific Pipelines' to 'Agentic Workflows' where LLMs plan execution
Integration of code-generation and retrieval (RAG) as standard components in data preparation architectures

Comparison to Prior Work

vs. Rule-based: LLMs handle semantic ambiguity and unstructured text without manual regex crafting.
vs. Deep Learning: LLMs offer zero-shot/few-shot generalization across domains without extensive retraining.
vs. Crowd-sourcing: LLMs provide lower latency and cost compared to human annotation [not cited in paper]
+ 1 more
Shift: Moves from 'detect-then-correct' fixed pipelines to autonomous 'plan-execute-reflect' agent loops.

Limitations

High inference costs limit scalability for large datasets.
Hallucinations remain a risk, leading to incorrect data repairs or matches.
Reliable and robust agentic deployment remains under-explored compared to simple prompting.
Mismatch exists between advanced LLM capabilities and current weak evaluation protocols.

Reproducibility

Survey paper; discusses various existing datasets and methods but does not release a single unified codebase. References specific systems like 'TableGPT2', 'Clean Agent', and 'AutoDCWorkflow'.

📊 Experiments & Results

Evaluation Setup

Survey of existing literature methods across three primary tasks.

Benchmarks:

Data Cleaning Benchmarks (Standardization, Error Repair)
Entity Matching Benchmarks (Record Linkage)
Schema Matching Benchmarks (Column Alignment)

Metrics:

Accuracy (Precision/Recall/F1)
Generalization capability
Manual effort reduction
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Shift toward cost-efficient hybrid methods: LLMs generate code or distill knowledge to SLMs rather than processing every row directly.
Reduced emphasis on fine-tuning: RAG and in-context learning are preferred over maintaining task-specific models.
Cross-modal generalization: Modern methods use unified representations for text and tables, reducing modality-specific engineering.
Agentic implementations are promising but still in early stages compared to prompt-based techniques.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of data management (cleaning, integration, schemas)
Familiarity with Large Language Models (LLMs) and prompting strategies
Knowledge of RAG (Retrieval-Augmented Generation) concepts

Key Terms

Data Cleaning: The process of detecting and repairing errors (e.g., typos, missing values) to ensure data consistency.

Entity Matching: Identifying records that refer to the same real-world entity across different datasets (e.g., linking 'Amazon' and 'Amazon.com').

Schema Matching: Identifying semantic correspondences between columns or attributes across different database schemas.

RAG: Retrieval-Augmented Generation—AI systems that enhance generation by retrieving relevant external context.

SLM: Small Language Model—models with fewer parameters often used for cost-efficient inference after distilling knowledge from larger LLMs.

Agentic Workflow: A system where LLMs autonomously plan, execute, and reflect on multi-step tasks rather than following a static script.