Tabby: Tabular Data Synthesis with Language Models

📝 Paper Summary

Synthetic Data Generation Tabular Deep Learning

Tabby modifies transformer LLMs by replacing standard layers with column-specific Mixture-of-Experts blocks, enabling high-fidelity tabular data synthesis when combined with a simplified 'Plain' training method.

Core Problem

Existing LLM-based tabular synthesis methods require complex preprocessing or permutation schemes and struggle to model column interdependencies, while non-LLM methods (GANs, Diffusion) often fail on mixed-type data.

Why it matters:

Tabular data is ubiquitous (healthcare, finance, logs) but under-researched compared to text/images
Creating a 'Large Tabular Model' from scratch is resource-prohibitive due to data scarcity and compute costs
Prior adaptations of LLMs to tables treat columns as generic text, failing to capture specific statistical properties of distinct features

Concrete Example: In the Rainfall dataset, prior SOTA LLM methods (GReaT) fail to produce any valid samples in 3/3 runs, while diffusion models like Tab-DDPM cannot model non-integer regression targets, distorting the distribution.

Key Novelty

Column-Specific Mixture-of-Experts (Tabby) + Simplified Serialization (Plain)

Replaces specific LLM layers (e.g., the language head) with a Mixture-of-Experts block where each 'expert' is dedicated to a single column in the table
Uses a deterministic gating function that routes tokens to the expert corresponding to the current column being generated
Introduces 'Plain' training: a simple serialization format (Value is Value <EOC>) without the complex permutations or conditioning used in prior works like GReaT

Architecture

Comparison of Standard LLM vs. Tabby LLM architecture. Shows how specific blocks (like the LM Head) are replaced by a bank of experts.

Evaluation Highlights

Achieves parity with real data (Machine Learning Efficacy score) on 5 out of 8 datasets, effectively serving as a drop-in replacement
Outperforms the prior SOTA diffusion model (TabDiff) and LLM method (GTT) in aggregate Area Under Performance Profile (AUP) score
Small Tabby models (Distilled-GPT2, ~82M params) outperform much larger non-Tabby Llama-3-8B models on tabular synthesis tasks

Breakthrough Assessment

8/10

Significantly simplifies LLM tabular training while beating complex diffusion and GAN baselines. The architectural modification is elegant and enables small models to punch well above their weight.

⚙️ Technical Details

Problem Definition

Setting: Generative modeling of tabular datasets with V columns, potentially including mixed types (numerical, categorical, nested JSON)

Inputs: A dataset of rows, serialized into text sequences

Outputs: Synthetic rows that preserve the statistical properties and downstream utility (Machine Learning Efficacy) of the original data

Pipeline Flow

Serialization (Plain): Convert table row to text string
Tabby LLM (Modified Transformer): Process tokens with column-specific routing
Generation: Auto-regressive sampling of new rows

System Modules

Serializer

Converts tabular row to sequence of tokens

Model or implementation: Deterministic rule

Tabby Transformer

Language model predicting next token, with specific layers replaced by MoE

Model or implementation: Distilled-GPT2 (modified) or Llama (modified)

Novel Architectural Elements

Column-specific Mixture-of-Experts (MoE) layers replacing standard LM blocks (specifically the LM Head in 'Tabby MH')
Deterministic gating function y_i = sum(Indicator{i=j} * f_j(x)) where expert selection is hard-coded to the current column index

Modeling

Base Model: Distilled-GPT2 (primary), Llama-3-8B, Llama-2-1.2B (for scaling comparisons)

Training Method: Supervised Fine-Tuning (Next Token Prediction) with specialized MoE architecture

Objective Functions:

Purpose: Minimize negative log-likelihood of the target tokens.

Formally: Standard Cross-Entropy Loss on next-token prediction.

Adaptation: Full fine-tuning (Distilled-GPT2) or LoRA (Llama models)

Trainable Parameters: Column-specific experts diverge from base initialization; parameter count scales with number of columns (one expert per column)

Training Data:

8 tabular datasets (Adult, Census, Diabetes, etc.)
Serialized using 'Plain' format: ordered list of 'Column is Value' strings separated by EOC tokens

Key Hyperparameters:

base_model: Distilled-GPT2
epochs: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. GReaT: Tabby uses fixed column ordering (Plain) and architectural experts, avoiding complex permutations and valid-sample filtering
vs. Tab-DDPM: Tabby handles mixed types (text, categorical) naturally via LM tokenization, whereas Tab-DDPM struggles with non-integer regression targets
vs. TabDiff: Tabby achieves higher aggregate performance (AUP) with a simpler generative approach

Limitations

Parameter count scales linearly with the number of columns (one expert per column), potentially requiring parameter sharing for very wide tables
Requires fine-tuning a separate model for each dataset (no single foundation model for all tables)
Privacy preservation is similar to prior works but lacks formal guarantees like Differential Privacy

Reproducibility

Code: https://github.com/soCromp/tabby

📊 Experiments & Results

Evaluation Setup

Train generative model on train split, generate synthetic data, train downstream classifier/regressor on synthetic data, evaluate on real test split.

Benchmarks:

Adult (Classification)
Census (Classification)
Diabetes (Classification)
Travel (Classification)
Shoppers (Classification)
Magic (Classification)
House (Regression)
Abalone (Regression)
Rainfall (Regression)

Metrics:

Machine Learning Efficacy (MLE) - Accuracy or R2
Area Under Performance Profile (AUP)
Discrimination (Real vs Synthetic classifier accuracy)
Statistical methodology: Results averaged over 3 runs (Rainfall) or 2 runs (Scaling experiments). No formal significance tests reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Machine Learning Efficacy (MLE) comparisons show Tabby (Plain) achieving top-tier performance across diverse datasets, often matching real data utility.
Adult	Accuracy	0.854	0.854	0.000
Census	Accuracy	0.956	0.954	-0.002
Diabetes	Accuracy	0.748	0.757	+0.009
House	R2	0.838	0.814	-0.024
Rainfall	R2	0.0	0.505	+0.505
Model scaling experiments on Travel dataset show smaller Tabby models outperforming larger standard LLMs.
Travel	Accuracy	0.932	0.957	+0.025

Experiment Figures

Performance profiles (AUP curves) comparing 14 synthesis methods across 8 datasets.

Scatter plots of 'Median House Value' vs 'Median Income' for Real, Tabby, GTT, and Tab-DDPM data.

Main Takeaways

Plain-trained Tabby MH achieves the highest Area Under Performance Profile (AUP) across all 8 datasets, surpassing both LLM (GReaT, GTT) and Diffusion (TabDiff, TabSyn) baselines.
Tabby enables small models (Distilled-GPT2) to outperform much larger models (Llama-3-8B) on tabular tasks, suggesting architectural specialization beats parameter scaling for this modality.
The 'Plain' training method is surprisingly effective, outperforming the complex permutation-based GReaT method on all datasets when using the same base model.
Tabby generalizes to nested JSON data, achieving parity with real data where baseline discrimination accuracy is near 50% (indistinguishable).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, MLP blocks)
Mixture-of-Experts (MoE) layers
Language Model fine-tuning
Tabular data metrics (Machine Learning Efficacy, R2, Accuracy)

Key Terms

Tabby: The proposed architecture modification replacing standard transformer layers with column-specific Mixture-of-Experts

Plain: The proposed training technique using simple ordered serialization without permutations, contrary to GReaT

Mixture-of-Experts (MoE): A neural network architecture where different sub-networks (experts) are activated for different inputs; here, experts are assigned to specific table columns

Machine Learning Efficacy (MLE): A metric evaluating synthetic data quality by training a classifier/regressor on synthetic data and testing it on real real test data

GReaT: Prior SOTA LLM tabular method that permutes column orders during training to learn conditional distributions

Tab-DDPM: A diffusion-based tabular synthesis model

Distilled-GPT2: A smaller, distilled version of the GPT-2 language model used as the base for most experiments

Llama-3-8B: A large open-weights language model used for comparison to show Tabby's efficiency

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique

GTT: GReaT combined with TapTap (pretraining) and Tabula (encoding), a strong LLM baseline

EOC: End-of-Column token introduced by the authors to delimit feature values in the serialization