LLM Meeting Decision Trees on Tabular Data

📝 Paper Summary

LLMs for Tabular Data Tree-based Models Neuro-symbolic Learning

DeLTa enhances decision tree ensembles by using an LLM to synthesize a refined logical rule from tree paths, which then guides a gradient-based error correction module to calibrate predictions.

Core Problem

Existing LLM-for-tabular methods rely on serializing data into text, which often fails for anonymized features, risks privacy exposure, and struggles with numerical values.

Why it matters:

Many real-world datasets (e.g., finance) use anonymized headers, breaking methods that rely on semantic column names
Directly feeding row data to LLMs exposes sensitive information in privacy-critical fields like healthcare
Fine-tuning LLMs on tabular data is computationally expensive and often limited to few-shot scenarios due to context length constraints

Concrete Example: A financial dataset might have column 'V1' instead of 'Income'. A serialization method like TabLLM producing 'The V1 is 5000' fails because the LLM lacks semantic context for 'V1', whereas a decision tree rule 'V1 > 3000' captures the structural logic directly.

Key Novelty

Decision Tree Enhancer with LLM-derived Rule (DeLTa)

Instead of feeding data samples to the LLM, DeLTa feeds *decision tree rules* (logic) to the LLM.
The LLM acts as a meta-reasoner to summarize diverse, potentially conflicting rules from a Random Forest into a single, higher-quality refined rule.
This refined rule partitions the data into leaf nodes where a lightweight 'Gradient Net' learns to predict error correction vectors, calibrating the original ensemble's output.

Architecture

The overall framework of DeLTa.

Evaluation Highlights

Achieves state-of-the-art performance on 22 diverse tabular benchmarks compared to both traditional tree methods (XGBoost, CatBoost) and deep learning methods (FT-Transformer).
Demonstrates lower intra-node sample distance (better grouping) using LLM-refined rules compared to original decision tree rules.
Effective in full-data learning settings without requiring any LLM fine-tuning, unlike prior LLM-tabular works restricted to few-shot or classification-only tasks.

Breakthrough Assessment

8/10

Novel approach that sidesteps the serialization bottleneck by operating on logic (rules) rather than raw data. Solves key privacy and scalability issues in LLM-tabular integration.

⚙️ Technical Details

Problem Definition

Setting: Supervised tabular prediction (classification and regression) using a training set D_train to minimize expected loss L(G(x), y).

Inputs: Tabular features x (numerical and categorical) and an ensemble of decision tree rules R derived from a Random Forest.

Outputs: Final calibrated prediction F*(x).

Pipeline Flow

Rule Extraction (Random Forest -> Rule Set R)
Rule Refinement (LLM + Prompt -> Refined Rule r*)
Gradient Set Construction (Compute negative gradients for train data)
Gradient Net Training (Train leaf-specific models on r* partitions)
Inference (RF Prediction + Gradient Net Correction -> Final Prediction)

System Modules

Base Estimator

Train diverse decision tree experts to extract a set of logical rules

Model or implementation: Random Forest (ensemble of CART trees)

LLM Reasoner

Summarize and refine the set of extracted rules into a single improved rule

Model or implementation: LLM (e.g., GPT-4 or similar via API)

Gradient Net

Predict sample-specific error correction vectors within leaf nodes of r*

Model or implementation: CART (Decision Tree)

Novel Architectural Elements

Rule-based LLM interface: integrating LLMs via logical rule refinement instead of data serialization.
Refined rule-guided calibration: using the LLM-generated rule to partition space for a secondary Gradient Net that predicts residuals.

Modeling

Base Model: Random Forest (for initial rules) + LLM (for rule refinement) + CART (for Gradient Net)

Training Method: Gradient-based error correction (Conceptually similar to a single boosting step)

Objective Functions:

Purpose: Train Gradient Net to predict negative gradients.

Formally: Minimize L2 distance between predicted correction and actual negative gradient -∇_F(x) L(F(x), y).

Training Data:

Gradient Set D_train^∇: Replaces labels y in D_train with negative gradients of the loss function w.r.t initial predictions.

Key Hyperparameters:

learning_rate_eta: Controls step size of correction (similar to boosting learning rate)
LLM_temperature: Not reported in the paper
Random_Forest_estimators: Not explicitly reported in the paper (standard tuning implied)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TabLLM/LIFT: DeLTa avoids data serialization and LLM fine-tuning entirely, operating on rules instead.
vs. XGBoost: DeLTa uses an LLM to globaly refine the rule structure before a correction step, rather than purely greedy iterative boosting.
vs. TP-BERT: DeLTa does not require pre-training on large tabular corpora.
+ 1 more
vs. Forest-based methods: DeLTa adds a calibration step guided by an LLM-synthesized rule, rather than just averaging ensemble predictions.

Limitations

Dependency on the quality of the initial Random Forest rules.
Inference cost includes LLM API call (though only once per dataset training to generate rule r*, not per sample) and additional tree evaluations.
The specific LLM used for experiments is not explicitly named in the provided text (likely GPT-3.5/4 given the context of 'recent success').

Reproducibility

Code: https://github.com/HangtingYe/DeLTa

Publicly available code at https://github.com/HangtingYe/DeLTa. The paper does not specify the exact LLM used (e.g., GPT-3.5/4) in the main text summary, but implies a generic LLM capability via API. The prompt template is provided in Appendix A.4.

📊 Experiments & Results

Evaluation Setup

Full-data supervised learning on diverse tabular datasets.

Benchmarks:

Various tabular benchmarks (Binary Classification, Multiclass Classification, Regression)

Metrics:

Accuracy (Classification)
RMSE (Regression)
ROC-AUC (Binary Classification)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper claims state-of-the-art performance but the provided text excerpt does not contain specific result tables with extractable numbers. It refers to 'extensive experiments' and 'Figure 2' for intra-node distance but lacks a results table in the excerpt.

Experiment Figures

Comparison of average intra-node distance between original decision tree rules and the LLM-refined rule.

Main Takeaways

DeLTa achieves state-of-the-art performance across diverse benchmarks (classification and regression).
Qualitative analysis shows LLM-refined rules produce leaf nodes with lower intra-node sample distance than original RF rules, indicating better clustering of similar samples.
The method successfully avoids the privacy and modality gap issues of data serialization methods.
DeLTa is applicable to full-data settings without the need for computationally expensive LLM fine-tuning.

📚 Prerequisite Knowledge

Prerequisites

Decision Trees (CART, Random Forest)
Gradient Boosting concepts (residuals, negative gradients)
Large Language Models (In-context learning)

Key Terms

Serialization: Converting structured tabular rows into natural language strings (e.g., 'Age is 25') so LLMs can process them.

Decision Tree Rule: A logical path from root to leaf in a tree, expressed as a series of if-else conditions (e.g., If Age > 20 and Income < 50k).

Gradient Net: A learnable module in DeLTa that predicts the negative gradient (error direction) for samples within a specific leaf node defined by the refined rule.

Error Correction Vector: A vector added to the base model's prediction to shift it towards the ground truth, approximated using the negative gradient of the loss.

Intra-node distance: A measure of how similar data samples are within a single leaf node of a decision tree; lower distance implies better grouping.