Inter-column relationships: Logical dependencies between columns, such as 'City' determining 'Country' (hierarchical) or 'Start Date' < 'End Date' (temporal).
Score-based diffusion: A generative modeling approach that learns the gradient of the data log-density (score) to iteratively denoise random noise into data.
Latent space: A compressed vector representation of data where the diffusion model operates, making generation more efficient than in raw data space.
Serialization: Converting tabular data (schema and descriptions) into a text string format that an LLM can process to infer relationships.
Hierarchical consistency: Dependencies where one column represents a subgroup of another (e.g., City belongs to Country).
Mathematical dependencies: Deterministic relationships defined by formulas (e.g., Total = Price * Quantity).
Temporal dependencies: Sequential constraints on time-based columns (e.g., Step 1 happens before Step 2).
SMOTE: Synthetic Minority Over-sampling TEchnique—a classic interpolation-based method for generating synthetic data.
CTGAN: Conditional Tabular GAN—a generative adversarial network specifically designed for tabular data.
VAE: Variational Autoencoder—a neural network that learns to compress data into a latent space and reconstruct it.