A Survey of LLM $\times$ DATA

📝 Paper Summary

Data Management for LLMs LLMs for Data Management

This survey provides a comprehensive taxonomy of the bidirectional relationship between data management and LLMs, defining the 'IaaS' (Inclusiveness, Abundance, Articulation, Sanitization) criteria for LLM data.

Core Problem

The integration of LLMs and data management is fragmented; LLMs require massive, high-quality data handling that challenges traditional systems, while traditional data tasks (cleaning, integration) remain manual and brittle.

Why it matters:

Scalable development of LLMs depends on handling petabytes of multi-modal data efficiently, which current ad-hoc pipelines struggle to manage
Traditional data manipulation relies on rigid rules that fail on complex or noisy data, whereas LLMs offer semantic flexibility but lack systematic integration
Performance of RAG systems degrades significantly (e.g., up to 12%) as document volume increases, requiring specialized data serving techniques

Concrete Example: In RAG, relying solely on vector similarity for 100,000-page datasets causes up to 12% accuracy degradation compared to smaller sets. Conversely, standardizing date formats like 'Fri Jan 1st' vs '1996.07.10' traditionally requires complex regex scripts, whereas LLMs can resolve these semantic inconsistencies automatically.

Key Novelty

Bidirectional Data×LLM Taxonomy & IaaS Framework

Proposes 'IaaS' (Inclusiveness, Abundance, Articulation, Sanitization) as a principled framework to evaluate dataset quality across the LLM lifecycle
Systematically categorizes DATA4LLM (processing, storage, serving for models) and LLM4DATA (using models for cleaning, integration, system tuning), unlike prior surveys focusing only on pre-training

Architecture

The bidirectional framework of Data×LLM. The left side shows DATA4LLM (Processing, Storage, Serving feeding into LLM lifecycle). The right side shows LLM4DATA (LLM enabling Manipulation, Analysis, Optimization).

Evaluation Highlights

Highlights that RAG accuracy drops by up to 12% when scaling from 10,000 to 100,000 pages without advanced data serving techniques (Source: EyeLevel.ai)
Notes that deep learning-based system tuning (e.g., RL/BO) requires >20 hours of workload replays for a single TPC-H workload, which LLM-based tuning can accelerate via context
References DeepSeek-R1's 671B parameters and Qwen2.5-VL's ~4TB tokens as benchmarks for the scale of data storage and processing required

Breakthrough Assessment

9/10

A foundational survey that defines the scope of a new interdisciplinary field. It unifies scattered techniques into a coherent taxonomy, essential for future research in both database and AI communities.

⚙️ Technical Details

Problem Definition

Setting: Survey and taxonomy construction for the intersection of Large Language Models and Data Management systems

Inputs: Literature across database systems, machine learning, and natural language processing

Outputs: Taxonomy of techniques, challenges, and the 'IaaS' data quality framework

Pipeline Flow

DATA4LLM: Processing (Acquisition → Deduplication → Filtering → Selection → Mixing → Synthesis)
DATA4LLM: Storage (Formats → Distribution → Organization → Movement → Fault Tolerance)
DATA4LLM: Serving (Shuffling → Compression → Packing → Provenance)
LLM4DATA: Manipulation (Cleaning → Integration → Discovery)
LLM4DATA: Analysis (Structured → Semi-structured → Unstructured)
LLM4DATA: Optimization (Configuration Tuning → Query Optimization → Anomaly Diagnosis)

System Modules

Data Processing (DATA4LLM)

Prepare high-quality data for training

Model or implementation: Various heuristic and model-based tools (e.g., deduplication hashes, quality filters)

Data Serving (DATA4LLM)

Optimize data delivery during inference/RAG

Model or implementation: Vector databases, RAG re-rankers, prompt compressors

System Optimization

Tune database parameters and diagnose errors

Model or implementation: LLM agents with RAG (retrieving manuals/logs)

Novel Architectural Elements

Unified 'IaaS' framework for data quality evaluation
Bidirectional taxonomy integrating database system principles (storage, I/O) with AI lifecycle needs

Modeling

Base Model: N/A (Survey paper discussing various models including DeepSeek-R1, Qwen2.5-VL)

Comparison to Prior Work

vs. Pre-training Surveys: Covers full lifecycle including RAG, SFT, and Agents [not cited in paper]
vs. Traditional Data Management: Incorporates semantic reasoning of LLMs for tasks like schema matching and cleaning
vs. ML4DB Surveys: Focuses specifically on unique capabilities of Large Language Models (reasoning, code generation) rather than general ML optimization

Limitations

Handling semi-structured data (JSON, spreadsheets) with LLMs remains brittle compared to structured data
Privacy and copyright concerns in large-scale data processing are identified but solutions are still maturing
High computational cost of using LLMs for routine data tasks (e.g., cleaning millions of rows) is a bottleneck

Reproducibility

Not provided (Survey paper). Code availability for specific referenced methods varies by citation.

📊 Experiments & Results

Evaluation Setup

Review and synthesis of experimental results from referenced literature

Benchmarks:

TPC-H (Database query performance)
Biomedical/Legal domains (Domain-specific QA)

Metrics:

RAG Accuracy
Tuning time (hours)
Data processing scale (TB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RAG Performance (EyeLevel.ai)	Accuracy Degradation	0	12	12
System Tuning	Tuning Time	20	Not reported in the paper	Not reported in the paper

Experiment Figures

The data flow across the 7 stages of the LLM lifecycle: Pre-training, Continual Pre-training, SFT, RL, RAG, Agents, and Evaluation.

Main Takeaways

Data quality (IaaS) is the primary bottleneck for advancing LLM capabilities beyond architectural tweaks
The relationship is bidirectional: LLMs need DB techniques for scale (storage/serving), and DBs need LLMs for semantic automation (cleaning/tuning)
RAG systems face non-linear performance degradation with scale, necessitating database-inspired serving layers (filtering/reranking)
LLMs are replacing rigid rule-based systems for data integration and cleaning due to superior semantic understanding

📚 Prerequisite Knowledge

Prerequisites

Understanding of the LLM lifecycle (Pre-training, SFT, RAG, Inference)
Basic knowledge of database systems (Storage, Indexing, Query Optimization)
Familiarity with data engineering pipelines (ETL, Cleaning, Integration)

Key Terms

DATA4LLM: The domain of using data management techniques (processing, storage, serving) to support the lifecycle of Large Language Models

LLM4DATA: The domain of using Large Language Models to enhance data management tasks (cleaning, integration, system optimization)

IaaS: Inclusiveness, Abundance, Articulation, Sanitization—the four essential dimensions proposed for assessing LLM dataset quality

RAG: Retrieval-Augmented Generation—systems that retrieve external documents to ground LLM responses

KV-cache: Key-Value cache—storing intermediate attention calculations during LLM inference to avoid re-computation

SFT: Supervised Fine-Tuning—training an LLM on labeled examples to follow instructions

CoT: Chain-of-Thought—prompting technique where the model generates intermediate reasoning steps

BO: Bayesian Optimization—a strategy for global optimization of black-box functions, often used in system tuning

RL: Reinforcement Learning—training agents to take actions in an environment to maximize cumulative reward

PII: Personally Identifiable Information—sensitive data that must be filtered from training sets