SampleLLM: Optimizing Tabular Data Synthesis in Recommendations

📝 Paper Summary

Tabular Data Synthesis Recommender Systems

SampleLLM improves LLM-based tabular data synthesis for recommender systems by aligning the synthetic data distribution with the original data through chain-of-thought prompting and feature attribution-based importance sampling.

Core Problem

Existing tabular synthesis methods struggle with sparse recommendation data, and LLM-based approaches often produce data with inconsistent distributions and lack of diversity due to misalignment between the LLM's inherent knowledge and the target dataset.

Why it matters:

Recommender systems suffer from data sparsity (cold start), severely limiting model performance.
Traditional statistical/deep learning synthesis methods fail to capture semantic feature relationships essential for modern recommendation models.
Directly prompting LLMs for data often results in biased distributions that degrade downstream model utility rather than enhancing it.

Concrete Example: When using limited exemplars, a simplistic random selection for few-shot learning might overlook important regions of the original dataset (e.g., rare user-item interactions), leading to reduced output diversity and a mismatch in feature distributions as shown in the paper's motivational figure.

Key Novelty

Two-stage Distribution Alignment Framework (SampleLLM)

Stage 1: Uses Chain-of-Thought refined instructions and cluster-based exemplar sampling to generate diverse initial synthetic data that better captures semantic relationships.
Stage 2: Applies a novel feature attribution-based importance sampling method to re-weight synthetic samples, ensuring their statistical distribution matches the original dataset's key feature interactions.

Architecture

The two-stage framework of SampleLLM. Stage 1: LLM-based generation with refined instructions and cluster-sampled exemplars. Stage 2: Distribution alignment via feature attribution and importance sampling.

Evaluation Highlights

Outperforms state-of-the-art baselines (including Tabula and GReaT) on 3 recommendation datasets and 2 general tabular datasets.
Online deployment in a Huawei app scenario showed a +1.45% improvement in CTR (Click-Through Rate) and +1.18% in CVR (Conversion Rate).
Achieves superior machine learning efficacy (MLE), meaning models trained on SampleLLM's synthetic data perform closer to models trained on real data compared to other synthesis methods.

Breakthrough Assessment

7/10

Significant practical contribution by addressing the specific distribution alignment issues of LLMs in tabular data. The two-stage approach is methodologically sound and validated in a real-world online setting.

⚙️ Technical Details

Problem Definition

Setting: Generating a synthetic dataset D_s that mimics the distribution and utility of an original tabular dataset D_o = {x_1, ..., x_N} with attributes and labels.

Inputs: Original tabular dataset D_o (sparse, limited samples), LLM for generation.

Outputs: Synthetic dataset D_s with aligned distribution P(D_s) ≈ P(D_o).

Pipeline Flow

Instruction Design (Manual selection + CoT refinement)
Exemplar Selection (Clustering-based sampling)
LLM Generation (Few-shot synthesis)
Distribution Alignment (Feature attribution + Importance sampling)

System Modules

Instruction Refiner

Optimizes the task description prompt using CoT with official documents and data samples to extract key information.

Model or implementation: LLM (e.g., GPT-3.5-turbo)

Exemplar Selector

Selects diverse examples from the original dataset to serve as few-shot prompts.

Model or implementation: K-Means Clustering

Data Generator

Generates raw synthetic tabular samples using the refined instruction and diverse exemplars.

Model or implementation: LLM (e.g., GPT-3.5-turbo)

Distribution Aligner

Re-weights synthetic samples to match the original distribution using importance sampling based on feature interactions.

Model or implementation: Feature Attribution (Taylor Expansion) + Importance Sampling

Novel Architectural Elements

Integration of feature attribution (via Taylor expansion) to identify non-independent feature groups for constructing a semi-independence probability model used in importance sampling.

Modeling

Base Model: gpt-3.5-turbo-0613 (for data generation)

Training Method: Importance Sampling (Post-generation weighting)

Objective Functions:

Purpose: Calculate sample weights to align synthetic distribution to real distribution.

Formally: w(x) = P_Do(x) / P_Ds'(x), where probabilities are estimated using semi-independent feature groups identified by interaction attribution.

Key Hyperparameters:

number_of_exemplars_a: 10
batch_size_b: 10
interaction_threshold_gamma: 0.01
+ 1 more
generation_rounds_Q: Not explicitly reported as a fixed number, varies by dataset size reqs

Compute: Experiments run on NVIDIA Tesla V100 GPU.

Comparison to Prior Work

vs. Tabula/GReaT: SampleLLM avoids expensive fine-tuning and explicitly aligns distributions via post-hoc importance sampling rather than relying solely on the generative model's learned distribution.
vs. CTGAN/TVAE: SampleLLM leverages the semantic understanding of LLMs, which statistical deep learning models lack, enabling better handling of textual features in recommendations.

Limitations

Dependency on the quality of the underlying LLM (e.g., GPT-3.5).
Inference costs for LLM generation can be high compared to GANs/VAEs.
The semi-independence assumption may still miss some complex high-order feature interactions.

Reproducibility

Code availability is not provided in the paper. The paper uses gpt-3.5-turbo-0613. Hyperparameters for baselines are detailed in Appendix B.1.

📊 Experiments & Results

Evaluation Setup

Train downstream models on synthetic (or augmented) data and evaluate on real test data.

Benchmarks:

MIND (News Recommendation (Click Prediction))
Alibaba (E-commerce Recommendation (Click Prediction))
ML-100k (Movie Recommendation (Rating Prediction))
Adult (Income Classification (General Tabular))
Churn (Customer Churn Prediction (General Tabular))

Metrics:

AUC (Area Under Curve)
LogLoss
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Machine Learning Efficacy (MLE) results where models are trained purely on synthetic data and evaluated on real test data. SampleLLM generally outperforms baselines.
MIND	AUC	0.5824	0.5891	+0.0067
Alibaba	AUC	0.5731	0.5802	+0.0071
ML-100k	AUC	0.6401	0.6511	+0.0110
Data Augmentation results where synthetic data is added to the real training set.
MIND	AUC	0.6135	0.6272	+0.0137
Alibaba	AUC	0.5815	0.5983	+0.0168

Experiment Figures

Impact of synthetic data ratio on model performance (AUC) for different methods on MIND and Alibaba datasets.

Sensitivity analysis of hyper-parameters: number of exemplars (a) and interaction threshold (gamma).

Main Takeaways

SampleLLM consistently achieves the best performance in both MLE (synthetic training only) and augmentation settings across recommendation datasets.
The method generalizes well to general tabular tasks (Adult, Churn), showing it's not limited to recommendations.
Ablation studies confirm that both the clustering-based exemplar selection and the importance sampling alignment module are critical to the performance gains.

📚 Prerequisite Knowledge

Prerequisites

Tabular data structure (features, labels)
Large Language Models (few-shot prompting)
Importance Sampling
Recommender Systems basics (CTR/CVR prediction)

Key Terms

CoT: Chain-of-Thought—a prompting strategy where the model is encouraged to generate intermediate reasoning steps before the final answer.

Importance Sampling: A statistical technique used to estimate properties of a target distribution by sampling from a different proposal distribution and weighting the samples.

Feature Attribution: Methods to explain model predictions by assigning importance scores to input features (e.g., using gradients or perturbations).

MLE utility: Machine Learning Efficacy—a metric measuring how well a model trained on synthetic data performs on real test data compared to a model trained on real data.

Semi-independence assumption: A simplifying assumption where only strongly interacting feature fields are considered dependent, while others are treated as independent to reduce computational complexity in probability estimation.