LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation

📝 Paper Summary

Synthetic Tabular Data Generation Logical Consistency in Data Synthesis

LLM-TabLogic uses Large Language Models to extract and compress logical rules from tabular data, then conditions a score-based diffusion model to generate synthetic data that strictly adheres to these rules.

Core Problem

Existing generative models for tabular data focus on global statistical properties but fail to maintain domain-specific logical consistency between columns.

Why it matters:

Industrial applications (e.g., supply chains) require data to follow strict rules, such as delivery dates occurring after order dates.
Synthetically generated data without logical consistency lacks real-world utility for simulation and decision-making.
Current methods like GANs and Diffusion models struggle with discrete logical constraints, while LLMs are too slow for efficient large-scale generation.

Concrete Example: In a logistics dataset, a standard diffusion model might generate a row where 'Delivery Date' is before 'Order Date', or a retail record where 'Total Price' does not equal 'Unit Price' × 'Quantity'. LLM-TabLogic enforces these constraints explicitly.

Key Novelty

LLM-Reasoning Guided Diffusion

Uses an LLM as a 'relational conditioner' to infer, compress, and structurally encode logical dependencies (hierarchical, mathematical, temporal) from table metadata.
Decouples deterministic logic from probabilistic generation: rigid rules are handled by the LLM-derived compressor, while the diffusion model learns the remaining latent distribution.
Introduces a decompression phase that reconstructs the full data using the preserved logic, ensuring 100% consistency for deterministic relationships.

Architecture

The workflow of LLM-TabLogic, showing the serialization of data, LLM reasoning to extract logic, compression into latent space, diffusion training, and final reconstruction.

Evaluation Highlights

Achieves over 90% accuracy in logical inference on unseen tables, demonstrating strong generalization in capturing inter-column rules.
Outperforms state-of-the-art baselines (including TabSyn and GReaT) across data fidelity, utility, and privacy metrics on real-world industrial datasets.
Specifically preserves complex logical constraints that other methods violate, such as mathematical formulas and temporal sequences.

Breakthrough Assessment

8/10

Significant step forward in making synthetic tabular data usable for operations research by explicitly solving the logical inconsistency problem, which is often ignored by purely statistical generative models.

⚙️ Technical Details

Problem Definition

Setting: Generating synthetic tabular data X' that mimics the distribution of real data X while satisfying a set of logical constraints R derived from X.

Inputs: Tabular dataset X (columns C, values), column descriptions D, and target variable Y.

Outputs: Synthetic dataset X' that preserves statistical properties and logical inter-column relationships.

Pipeline Flow

Serialization: Convert table schema/metadata to text
LLM Reasoning: Extract logical rules (Hierarchical, Mathematical, Temporal)
Compression: deterministic logic removal & latent encoding via VAE
Diffusion Training: Score-based modeling in latent space
Generation & Decompression: Sampling latent vectors and reconstructing full data using preserved logic

System Modules

Serializer

Converts tabular schema and descriptions into natural language prompts

Model or implementation: Text-based formatting function

Relational Conditioner (LLM) (Reasoning & Compression)

Infers logical relationships and groups columns

Model or implementation: LLM (implied, likely GPT-4 or similar based on reasoning capabilities described)

Compressor (VAE Encoder) (Reasoning & Compression)

Encodes data into latent space while factoring out deterministic logic

Model or implementation: Transformer-based VAE

Score-based Diffusion

Learns the distribution of latent embeddings

Model or implementation: Score-based generative model (SDE)

Decompressor

Reconstructs full tabular data from synthetic latents and logic rules

Model or implementation: Deterministic mapping function

Novel Architectural Elements

Integration of an LLM as a 'relational conditioner' to explicitly guide the preprocessing and compression of tabular data before generative modeling.
Hybrid generation pipeline: Deterministic logic is handled by rule-based reconstruction (decompression) while stochastic patterns are handled by diffusion in latent space.

Modeling

Base Model: Score-based Diffusion Model (backbone) + Transformer VAE (encoder/decoder) + LLM (logic extractor)

Training Method: Denoising Score Matching for Diffusion; VAE training for compression

Objective Functions:

Purpose: Train the diffusion model to reverse the noise process.

Formally: Minimize expected L2 distance between predicted and actual score (gradient of log density).
Purpose: Train VAE to faithfully compress and reconstruct data.

Formally: Evidence Lower Bound (ELBO) consisting of reconstruction loss and KL divergence regularization.

Adaptation: Logic extraction is prompt-based (inference only); Diffusion model is trained from scratch on compressed latents

Trainable Parameters: Weights of the score network (MLP/Transformer) and VAE encoder/decoder

Training Data:

Real-world industrial datasets (Supply Chain, etc.) used for training and evaluation

Key Hyperparameters:

diffusion_steps: Not explicitly reported in the paper
noise_schedule: Variance Preserving (VP) SDE (implied by score-based formulation reference)
learning_rate: Not explicitly reported in the paper
+ 1 more
batch_size: Not explicitly reported in the paper

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. TabSyn: TabSyn uses VAE+Diffusion but ignores explicit logic rules; LLM-TabLogic compresses logic out before diffusion.
vs. GReaT: GReaT relies on LLM next-token prediction which can hallucinate values violating strict logic; LLM-TabLogic enforces logic deterministically.
vs. CTGAN: GANs struggle with discrete/mixed columns; LLM-TabLogic handles them via latent space diffusion and explicit rule preservation.

Limitations

Relies on the LLM's ability to correctly infer logic from metadata; poor metadata could lead to incorrect rules.
Deterministic reconstruction assumes logic rules are strict and universal (no noise/exceptions in the rules).
Two-stage process (Logic Extraction + Diffusion) is more complex than end-to-end models like CTGAN.

Reproducibility

Code: https://github.com/Van-Sot/LLM-TabLogic

Code is publicly available at https://github.com/Van-Sot/LLM-TabLogic. Specific hyperparameters for training (LR, batch size) and the exact LLM used (e.g., GPT-4 vs Llama) are not detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Benchmarking on real-world industrial datasets focusing on logic preservation, fidelity, and privacy.

Benchmarks:

Supply Chain Dataset (Industrial Tabular Data)
Another unnamed Industrial Dataset (Industrial Tabular Data)

Metrics:

Logical Inference Accuracy
Data Fidelity (Statistical similarity)
Data Utility (Machine Learning performance)
Privacy (Distance to nearest real neighbor)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Logical inference accuracy results demonstrate the method's ability to correctly identify column relationships.
Unseen Tables	Accuracy	Not applicable	90.0	Not applicable

Main Takeaways

LLM-TabLogic achieves >90% accuracy in identifying logical relationships, verifying the 'Relational Conditioner' component works.
Outperforms baselines (SMOTE, CTGAN, TabDDPM, TabSyn, GReaT) in preserving inter-column relationships while maintaining high data fidelity.
Offers a better trade-off between privacy and utility compared to raw data copying or naive generation.
Successfully handles complex dependencies (hierarchical, mathematical, temporal) that standard statistical models miss.

📚 Prerequisite Knowledge

Prerequisites

Score-based diffusion models (SDEs, forward/reverse processes)
Variational Autoencoders (VAEs) for latent space mapping
Large Language Models (LLMs) for reasoning/extraction

Key Terms

Inter-column relationships: Logical dependencies between columns, such as 'City' determining 'Country' (hierarchical) or 'Start Date' < 'End Date' (temporal).

Score-based diffusion: A generative modeling approach that learns the gradient of the data log-density (score) to iteratively denoise random noise into data.

Latent space: A compressed vector representation of data where the diffusion model operates, making generation more efficient than in raw data space.

Serialization: Converting tabular data (schema and descriptions) into a text string format that an LLM can process to infer relationships.

Hierarchical consistency: Dependencies where one column represents a subgroup of another (e.g., City belongs to Country).

Mathematical dependencies: Deterministic relationships defined by formulas (e.g., Total = Price * Quantity).

Temporal dependencies: Sequential constraints on time-based columns (e.g., Step 1 happens before Step 2).

SMOTE: Synthetic Minority Over-sampling TEchnique—a classic interpolation-based method for generating synthetic data.

CTGAN: Conditional Tabular GAN—a generative adversarial network specifically designed for tabular data.

VAE: Variational Autoencoder—a neural network that learns to compress data into a latent space and reconstruct it.