A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

📝 Paper Summary

Synthetic Tabular Data Generation LLM Prompting Strategies

Instead of generating data row-by-row or token-by-token, this method prompts LLMs to output probability distributions for categorical features and then samples from those distributions to create synthetic datasets.

Core Problem

LLMs struggle to generate realistic tabular data because their auto-regressive nature biases them towards linguistic coherence rather than statistical accuracy, often failing to capture complex feature dependencies.

Why it matters:

Standard cell-by-cell generation is computationally expensive (one query per cell) and prone to error propagation
Table-wide generation ignores fine-grained feature correlations and often produces 'average' or over-smoothed distributions
Existing methods often reflect the LLM's language priors rather than the actual statistical properties of the target domain

Concrete Example: In a California demographic dataset, standard cell-by-cell generation fails to preserve the inverse correlation between age and 'Latino' ethnicity, instead producing uniform ethnicity distributions across all age groups.

Key Novelty

Probability-Driven Prompting

Ask the LLM to output a probability distribution (e.g., a JSON of categories and their likelihoods) given a specific context, rather than generating a specific data value directly
Decompose generation into a hierarchy: generate a parent distribution (e.g., Age), then query for conditional distributions (e.g., Ethnicity given Age), and finally sample locally

Evaluation Highlights

Successfully reproduces the age-dependent ethnicity shift in California demographics (decreasing Latino population with age), whereas cell-by-cell approaches failed completely
Reduces computational overhead from O(Rows × Columns) queries to O(Columns × Categories) queries, requiring only 5-6 LLM calls to generate an arbitrary number of rows for the tested dataset

Breakthrough Assessment

7/10

Simple but highly effective shift in perspective: moving from generating samples to generating distributions. Solves a major scalability bottleneck in LLM-based tabular data generation while improving statistical fidelity.

⚙️ Technical Details

Problem Definition

Setting: Generating synthetic tabular data that preserves marginal distributions and conditional correlations of categorical variables without fine-tuning

Inputs: Context description (domain knowledge) and list of target categorical features

Outputs: A synthetic dataset (rows and columns) following the estimated statistical properties

Pipeline Flow

Context Setup → Distribution Query (LLM) → Conditional Distribution Query (LLM) → Local Sampling

System Modules

Context Definer

Define the dataset context, variables, and categories in a structured prompt

Model or implementation: Prompt Template

Distribution Estimator

Estimate probability distributions for categories given context or prior features

Model or implementation: gpt-4o

Sampler

Generate actual data rows by sampling from the estimated distributions

Model or implementation: Statistical Sampling Algorithm (Python script)

Novel Architectural Elements

Two-stage generation topology: first querying the LLM for abstract statistical distributions (metadata), then generating data via local sampling, effectively decoupling statistical inference from record generation

Modeling

Base Model: gpt-4o

Compute: Extremely low compared to baselines. For the California dataset (fixed State), only 1 initial prompt + 5-6 conditional prompts were required to generate distributions for any number of rows.

Comparison to Prior Work

vs. Cell-by-cell: Generates distributions rather than values, reducing complexity from O(rows) to O(1) LLM calls
vs. Table-wide: Allows explicit modeling of conditional dependencies (e.g., P(Ethnicity|Age)) rather than hoping the LLM captures them in a single pass
vs. Tabula [not cited in paper]: Tabula also uses distributions but often relies on fine-tuning or in-context learning with real samples; this method is zero-shot prompting for distribution estimation

Limitations

Reliance on LLM's pre-trained knowledge limits applicability to domains where the LLM lacks prior knowledge or where specific private data distributions must be matched exactly
Complexity scales with the number of conditional dependencies (columns) and unique categories, potentially requiring many prompts for high-dimensional data
Experiments limited to a single, simple case study (California demographics) without extensive benchmarking on standard tabular datasets

Reproducibility

Code: https://github.com/mostly-ai/paper-DataLLM-materials

📊 Experiments & Results

Evaluation Setup

Generation of a synthetic dataset representing California population demographics (Age and Ethnicity distributions).

Benchmarks:

California Demographics Case Study (Synthetic Data Generation) [New]

Metrics:

Visual inspection of distribution plots (Qualitative fidelity)
Implicit: Computational cost (Number of LLM calls)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The method is evaluated primarily through qualitative comparison of generated distributions against ground truth Census data.
California Demographics	LLM Queries (for N rows)	30000	7	-29993

Experiment Figures

Comparison of Age vs. Ethnicity distributions for Ground Truth, Table-wide prompting, Cell-by-cell prompting, and Probability-driven prompting.

Main Takeaways

Probability-driven prompting successfully captures complex conditional dependencies (e.g., Ethnicity changes with Age), which cell-by-cell generation failed to capture completely
The method decouples generation cost from dataset size; generating 1 million rows costs the same in LLM tokens as generating 100 rows, making it highly scalable
Table-wide prompting suffers from 'over-smoothing', producing variance that is too low compared to real data, while probability-driven prompting allows for controlled variance via sampling

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and auto-regressive generation
Familiarity with tabular data structure (rows, columns, categorical vs. continuous features)
Basic probability concepts (conditional probability, sampling from distributions)

Key Terms

auto-regressive: A generation process where the model predicts the next element in a sequence based on all previous elements

table-wide prompting: Prompting an LLM to generate an entire table or batch of rows in a single text output

cell-by-cell generation: Prompting an LLM to generate one specific cell value at a time, conditioned on previously generated cells in the row

conditional distribution: The probability distribution of a variable (e.g., Ethnicity) given the value of another variable (e.g., Age Group)

token bias: The tendency of LLMs to assign probabilities based on frequency in their training text corpus rather than the specific statistical context of the tabular task

statistical fidelity: How accurately the synthetic data reproduces the statistical properties (correlations, distributions) of the real-world data

GReaT: Generation of Realistic Tabular data—a framework that fine-tunes LLMs to generate tabular data by treating rows as text sequences