LLM Embeddings for Deep Learning on Tabular Data

📝 Paper Summary

Tabular Deep Learning Feature Representation Learning

This paper proposes transforming tabular data into natural language sentences and using frozen Large Language Model embeddings to improve feature representations for downstream deep learning models.

Core Problem

Existing tabular deep learning methods require separate, type-specific encoding for numerical and categorical features, which limits cross-table transfer and fails to capture semantic interactions between features and values.

Why it matters:

Training embeddings from scratch is restrictive for domains with scarce data
Disentangling feature names from values restricts the model's ability to learn meaningful context and interactions
Current methods struggle to leverage pre-trained knowledge from large-scale models

Concrete Example: In a medical dataset, a standard model encodes 'age' (numerical) and 'blood_pressure' (numerical) separately from 'diagnosis' (categorical). It misses the semantic context that 'Age is 75' implies different risks than just the number 75. The proposed method serializes this as 'This age is 75', allowing an LLM to capture the semantic nuance.

Key Novelty

Training-free LLM-based Tabular Feature Encoding

Serializes each feature-value pair into a sentence (e.g., 'This {col} is {value}') to unify numerical and categorical inputs into a single text modality
Utilizes off-the-shelf, frozen Large Language Models (LLMs) to generate embeddings for these sentences, replacing traditional lookup tables or linear embeddings
Projects these high-dimensional LLM embeddings into a trainable deep learning model using a simple shallow network adaptation layer

Architecture

The workflow of the proposed LLM-based embedding method for tabular data.

Evaluation Highlights

Improves FT-Transformer accuracy by an average of 3.05% across 7 datasets when using BGE embeddings compared to base embeddings
Outperforms base ResNet and MLP models on datasets with rich categorical features like 'credit-g' and 'student-performance'
Demonstrates that smaller, efficient models like BGE can perform competitively with or better than larger models like Llama-3 for tabular embedding tasks

Breakthrough Assessment

6/10

A practical, plug-and-play improvement for tabular DL. While it boosts neural performance, it still struggles to consistently beat XGBoost, limiting its transformative impact.

⚙️ Technical Details

Problem Definition

Setting: Supervised classification on heterogeneous tabular data with N samples and M features

Inputs: Tabular input matrix X containing mix of numerical and categorical features

Outputs: Predicted class label y_hat

Pipeline Flow

Text Serialization
LLM Encoding
Embedding Adaptation
Downstream Prediction

System Modules

Serializer

Converts tabular feature-value pairs into text strings

Model or implementation: Template-based: 'This {col} is {value}'

LLM Encoder (Feature Representation)

Maps serialized text to high-dimensional embeddings

Model or implementation: BGE-base-en-v1.5 (frozen) or Llama-3-8B (frozen)

Projector (Feature Representation)

Adapts LLM embeddings to the input dimension of the downstream learner

Model or implementation: Shallow one-layer neural network

Predictor

Learns mapping from embeddings to target labels

Model or implementation: FT-Transformer, ResNet, or MLP

Novel Architectural Elements

Replacement of type-specific feature encoders (lookup tables/linear layers) with a unified text-serialization and frozen LLM embedding pipeline followed by a shallow projection layer

Modeling

Base Model: BGE-base-en-v1.5 (default embedder), also tested Llama-3-8B and MiniLM

Training Method: Supervised Learning (Classification)

Objective Functions:

Purpose: Minimize classification error.

Formally: Standard classification loss (implied Cross-Entropy, though not explicitly formulated in text)

Adaptation: Shallow one-layer projection network trained on top of frozen LLM embeddings

Trainable Parameters: Projection layer weights + Downstream model weights (FT-Transformer/ResNet/MLP)

Training Data:

7 classification datasets (banking, medical, general domains)
70/30 stratified train/test split
20% of training used for validation

Key Hyperparameters:

embedding_dimension: 1024
max_epochs: 100
early_stopping_patience: 10
+ 1 more
min_improvement_delta: 0.01

Compute: Not reported in the paper

Comparison to Prior Work

vs. TabPFN: Handles categorical data natively via text; not constrained by sample/feature counts
vs. TabLLM: Keeps LLM frozen (feature extractor) rather than fine-tuning the whole model; computationally cheaper
vs. TPBERTa: Does not require separate processing for numerical/categorical features; uses unified text serialization
+ 1 more
vs. XGBoost: Uses deep learning embeddings; generally lower performance than XGBoost but narrows the gap compared to base DL models

Limitations

Increased computational complexity due to querying LLM for every feature-value pair (linear scaling)
Performance can deteriorate if feature names are obscure or domain-specific (not covered by LLM pre-training)
Still struggles to consistently outperform tree-based ensembles (Random Forest, XGBoost) on tabular tasks

Reproducibility

Code not provided. Datasets are public (UCI/OpenML). Experiments run in Singularity environment with WandB tracking. Hyperparameters for baselines (RF, XGB, SGD) provided.

📊 Experiments & Results

Evaluation Setup

Supervised classification on 7 diverse tabular datasets

Benchmarks:

bank-marketing (Binary Classification)
credit-g (Binary Classification)
heart (Multiclass Classification (0-4 severity))
diabetes (Binary Classification)
hepatitis (Binary Classification)
blood transfusion (Binary Classification)
student-performance (Multiclass Classification (7 levels))

Metrics:

Accuracy
Statistical methodology: Hierarchical Bayesian t-test

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Deep Learning models with and without LLM embeddings. 'Base' refers to standard learned embeddings; 'With LLM' uses BGE.
Average across 7 datasets	Accuracy gain	Not reported in the paper	Not reported in the paper	+3.05
credit-g	Accuracy	0.742	0.760	+0.018
student-performance	Accuracy	0.457	0.589	+0.132
blood transfusion	Accuracy	0.785	0.771	-0.014
Impact of LLM size/type on performance (BGE vs Llama-3 vs MiniLM).
heart	Accuracy	0.580	0.560	-0.020
Comparison against Tree-based baselines.
credit-g	Accuracy	0.768	0.760	-0.008

Experiment Figures

PCA visualization of the LLM-generated embeddings for features from the 'student-performance' and 'heart' datasets.

Results of the hierarchical Bayesian t-test comparing standard embeddings vs. LLM embeddings.

Main Takeaways

LLM embeddings consistently improve deep learning baselines (FT-Transformer, ResNet, MLP) on datasets with meaningful categorical features.
Performance degrades on datasets that are purely numerical or have niche, non-descriptive feature names (e.g., 'blood transfusion').
Larger LLMs (Llama-3) do not guarantee better performance than smaller, optimized embedding models (BGE); BGE is a strong default.
While LLM embeddings narrow the gap between Deep Learning and Tree-based models (XGBoost/RF), tree ensembles generally remain superior for tabular data.

📚 Prerequisite Knowledge

Prerequisites

Basics of Tabular Deep Learning architectures (ResNet, FT-Transformer)
Understanding of Text Embeddings and LLMs
Feature encoding techniques (One-hot, Linear)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

LLM: Large Language Model—a deep learning algorithm that can recognize, summarize, translate, predict, and generate text and other content

FT-Transformer: Feature Tokenizer Transformer—a deep learning architecture specifically designed for tabular data that uses Transformer layers to process feature embeddings

BGE: BAAI General Embedding—a text embedding model designed to map sentences into dense vector representations

ResNet: Residual Network—a neural network architecture that uses skip connections to allow training of deeper networks

MLP: Multi-Layer Perceptron—a simple feedforward artificial neural network

XGBoost: eXtreme Gradient Boosting—a scalable implementation of gradient boosting framework, often considered state-of-the-art for tabular data

MTEB: Massive Text Embedding Benchmark—a benchmark for evaluating the performance of text embedding models

PCA: Principal Component Analysis—a dimensionality reduction technique used to visualize high-dimensional data

SELU: Scaled Exponential Linear Unit—an activation function that induces self-normalizing properties in neural networks