← Back to Paper List

A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

A Sidorenko
MOSTLY AI
arXiv, 5/2025 (2025)
Reasoning Factuality

πŸ“ Paper Summary

Synthetic Tabular Data Generation LLM Prompting Strategies
Instead of generating data row-by-row or token-by-token, this method prompts LLMs to output probability distributions for categorical features and then samples from those distributions to create synthetic datasets.
Core Problem
LLMs struggle to generate realistic tabular data because their auto-regressive nature biases them towards linguistic coherence rather than statistical accuracy, often failing to capture complex feature dependencies.
Why it matters:
  • Standard cell-by-cell generation is computationally expensive (one query per cell) and prone to error propagation
  • Table-wide generation ignores fine-grained feature correlations and often produces 'average' or over-smoothed distributions
  • Existing methods often reflect the LLM's language priors rather than the actual statistical properties of the target domain
Concrete Example: In a California demographic dataset, standard cell-by-cell generation fails to preserve the inverse correlation between age and 'Latino' ethnicity, instead producing uniform ethnicity distributions across all age groups.
Key Novelty
Probability-Driven Prompting
  • Ask the LLM to output a probability distribution (e.g., a JSON of categories and their likelihoods) given a specific context, rather than generating a specific data value directly
  • Decompose generation into a hierarchy: generate a parent distribution (e.g., Age), then query for conditional distributions (e.g., Ethnicity given Age), and finally sample locally
Evaluation Highlights
  • Successfully reproduces the age-dependent ethnicity shift in California demographics (decreasing Latino population with age), whereas cell-by-cell approaches failed completely
  • Reduces computational overhead from O(Rows Γ— Columns) queries to O(Columns Γ— Categories) queries, requiring only 5-6 LLM calls to generate an arbitrary number of rows for the tested dataset
Breakthrough Assessment
7/10
Simple but highly effective shift in perspective: moving from generating samples to generating distributions. Solves a major scalability bottleneck in LLM-based tabular data generation while improving statistical fidelity.
×