Robust Tabular Foundation Models

📝 Paper Summary

Tabular Foundation Models (TFMs) Synthetic Data Generation Adversarial Training

RTFM improves tabular foundation models by adversarially optimizing the synthetic data generator to create datasets where the model currently underperforms relative to strong tree-based baselines.

Core Problem

Tabular Foundation Models are pretrained on synthetic data from fixed prior distributions, which underrepresent challenging regions of the parameter space, causing them to lag behind tree-based methods on some real-world tasks.

Why it matters:

Fixed priors in data generation leave blind spots where TFMs fail to generalize to complex real-world structures.
Deep learning methods still struggle to consistently beat gradient boosted trees (XGBoost, CatBoost) on structured data benchmarks.

Concrete Example: A TFM trained on balanced synthetic data might fail on a real-world dataset with high class imbalance or specific categorical feature ratios because those conditions were rare in its training distribution, while XGBoost handles them robustly.

Key Novelty

Adversarial Training over the Data Generator Space (RTFM)

Treats the parameters of the synthetic data generator (e.g., feature correlations, sparsity) as the adversarial space.
Maximizes an 'optimality gap': the difference between the TFM's loss and the loss of strong baselines (like XGBoost) on generated data.
Updates the sampling distribution to focus training on these 'hard' regions where the TFM is currently worse than traditional methods.

Architecture

The two-stage RTFM optimization loop. Stage 1 (Maximization) searches for SCM parameters with high optimality gaps using baseline models. Stage 2 (Minimization) trains the TFM on datasets sampled from these hard parameter regions.

Evaluation Highlights

+6% increase in mean normalized AUC over the original TabPFN V2 on TabArena and TabPertNet benchmarks.
Achieves state-of-the-art ranking (Rank 1.9 on TabArena) compared to XGBoost (3.4) and CatBoost (2.2).
Requires only <100k additional synthetic datasets for fine-tuning, a tiny fraction (~1%) of the original pretraining data.

Breakthrough Assessment

8/10

Offers a highly efficient, model-agnostic method to close the gap between deep learning and tree-based methods on tabular data using targeted synthetic data, demonstrating significant gains with minimal compute.

⚙️ Technical Details

Problem Definition

Setting: Classification on tabular datasets generated from parametrized Structural Causal Models (SCMs).

Inputs: A set of labeled support samples (training set) and unlabeled query samples (test set) from a generated dataset.

Outputs: Predicted labels for the query samples.

Pipeline Flow

Maximization Stage: Search for generator parameters
Minimization Stage: Train TFM on hard distributions

System Modules

Parameter Search (Adversary) (Maximization Stage)

Identifies SCM parameters where the TFM performs poorly relative to baselines.

Model or implementation: Tree-structured Parzen Estimator (TPE) via Optuna

Distribution Generator (Maximization Stage)

Constructs a sampling distribution Q over parameters based on their optimality gaps.

Model or implementation: Softmax distribution

Data Generator (Minimization Stage)

Creates synthetic datasets for training.

Model or implementation: Randomized Multi-Layer Perceptrons (MLPs) as SCMs

TFM Trainer (Minimization Stage)

Updates the TFM weights to minimize loss on the generated hard datasets.

Model or implementation: TabPFN V2 (Transformer-based)

Novel Architectural Elements

Adversarial loop over the data generation process itself, rather than input perturbations.
Use of 'Optimality Gap' (relative to baselines) as the adversarial objective instead of pure loss maximization.

Modeling

Base Model: TabPFN V2 (Transformer-based)

Training Method: Two-stage Adversarial Training (Min-Max Game)

Objective Functions:

Purpose: Maximize the gap between TFM loss and baseline loss to find hard distributions.

Formally: max_Q E_{phi ~ Q} [ L_{PFN}(W; phi) - min_k L_{PFN}(f_k; phi) ] subject to H(Q) >= H_min
Purpose: Minimize TFM loss on the identified hard distributions.

Formally: min_W E_{phi ~ Q*} [ L_{PFN}(W; phi) ]

Training Data:

Purely synthetic data generated during training loop
Generators (SCMs) are MLPs with parameters for depth, width, categorical ratio, missingness, etc.

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 64
training_steps_per_iteration: 3000
+ 3 more
max_min_epochs: 30
n_trials_search: 100
n_datasets_estimation: 20

Compute: Single node with one A100 GPU and 256 CPU cores (for parallel baseline fitting)

Comparison to Prior Work

vs. TabPFN V2: RTFM dynamically adapts the training data distribution instead of using a fixed prior.
vs. Wu and Bergman (2025): RTFM uses a broad SCM parameter space and 'optimality gap' rather than just adjusting weights of specific SCM classes.
vs. Standard Adversarial Training: Optimizes the *data generating process* parameters (e.g., causal structure) rather than adding noise to input features.

Limitations

Currently restricted to MLP-based Structural Causal Models (SCMs).
Requires fitting multiple baseline models (XGBoost, etc.) during the maximization step, which is computationally intensive (requires many CPU cores).
Depends on the quality and diversity of the baseline estimators to approximate the true 'optimality gap'.
Parameter search space is discrete/discretized.

Reproducibility

Code availability is not explicitly provided in the paper text. SCM parameter ranges and training hyperparameters are detailed in Appendix B. Uses standard libraries (Optuna, XGBoost, CatBoost).

📊 Experiments & Results

Evaluation Setup

Classification tasks on real-world tabular datasets.

Benchmarks:

TabPertNet (Tabular Classification)
TabArena (Tabular Classification)

Metrics:

Mean Rank AUC
Mean Normalized AUC
Rank-1 Wins
Statistical methodology: Friedman test for repeated samples on median normalized AUC; Wilcoxon signed-rank test for pair-wise comparisons.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RTFM significantly improves TabPFN performance on the TabPertNet benchmark, surpassing all baselines.
TabPertNet	Mean Rank AUC	3.2	2.7	-0.5
TabPertNet	Mean Norm. AUC	0.7483	0.8167	+0.0684
TabPertNet	Mean Norm. AUC	0.6481	0.8167	+0.1686
RTFM also dominates on the TabArena benchmark, achieving the best rank and AUC.
TabArena	Mean rank AUC OVO	2.2	1.9	-0.3
TabArena	Mean Norm. AUC OVO	0.9031	0.9298	+0.0267
TabArena	Mean Norm. AUC OVO	0.7749	0.9298	+0.1549

Experiment Figures

Maximum estimated optimality gap found during each maximization stage over the course of training epochs.

Main Takeaways

RTFM strictly dominates both the original TabPFN and all tree-based baselines (XGBoost, CatBoost, Random Forest) across tested benchmarks.
The method is particularly effective on datasets where the original TFM performed poorly (Rank >> 1), often helping the model 'leap' to the top rank.
Improvements are achieved with minimal additional synthetic data (90k datasets), proving the efficiency of targeted adversarial training.
The approach is model-agnostic and can be applied to other TFMs or extended to regression tasks.

📚 Prerequisite Knowledge

Prerequisites

Tabular Deep Learning
Prior-Fitted Networks (PFNs)
Structural Causal Models (SCMs)
Adversarial Training / Distributionally Robust Optimization

Key Terms

TFM: Tabular Foundation Model—a transformer-based model pretrained on vast amounts of synthetic data to perform in-context learning on new tabular tasks.

SCM: Structural Causal Model—a mathematical model used here to generate synthetic tabular datasets by defining causal relationships between features and targets.

Optimality Gap: The difference between the TFM's performance and the 'best achievable' performance (approximated by strong baselines like XGBoost) on a specific dataset.

DRO: Distributionally Robust Optimization—an optimization framework that seeks to minimize loss over the worst-case distribution within a uncertainty set.

TabPFN: A specific Tabular Foundation Model architecture that uses transformers to approximate Bayesian inference on tabular data.

PFN: Prior-Fitted Network—a neural network trained to approximate the posterior predictive distribution implied by a prior over datasets.