Programmatic Representation Learning with Language Models

📝 Paper Summary

Interpretability Program Synthesis Decision Tree Learning

LeaPR models combine LLM-synthesized Python feature functions with classical decision trees to create efficient, fully interpretable predictors that compete with neural networks on complex tasks.

Core Problem

Classical interpretable models like decision trees fail on high-dimensional data (images, text) because they lack useful high-level features, while deep learning offers performance but lacks interpretability and requires heavy compute.

Why it matters:

High-stakes decision-making requires models that can explain *why* a prediction was made, which neural networks struggle to provide reliably
Classical models are compute-efficient but require extensive manual feature engineering to work on complex domains like chess or images
Neural networks are data-hungry and generalize poorly when in-domain data is scarce

Concrete Example: In chess, a decision tree looking at raw board squares cannot easily learn 'winning probability' because single squares carry little information. However, a simple programmatic feature like 'material difference' makes the task trivial for a shallow tree. Neural networks learn these features implicitly but opaquely; LeaPR explicitly generates the code for them.

Key Novelty

Learned Programmatic Representations (LeaPR)

Uses LLMs as 'feature engineers' to write Python functions that map raw inputs (e.g., chess boards) to scalar values, leveraging the LLM's domain knowledge (e.g., chess libraries)
Replaces the static feature set of classical decision trees with a dynamic set generated by LLMs during training to maximize predictive power
Introduces D-ID3, a 'white-box' algorithm where the LLM generates specific features on-demand to split difficult leaf nodes based on the examples failing at that node

Architecture

Illustration of the LeaPR hypothesis class: raw input data flows into LLM-generated Python functions (programmatic features), which output scalars that are fed into a decision tree predictor.

Evaluation Highlights

LeaPR (D-ID3 with GPT-5-mini) achieves 98.8 F1 on Ghostbuster text classification, effectively matching the neural Ghostbuster baseline (99.0 F1)
In chess position evaluation, LeaPR models trained on 200k samples outperform a Transformer baseline trained on 50M samples (250x more data) in RMSE (0.245 vs 0.252)
On MNIST, LeaPR features allow a Random Forest to achieve 98.6% accuracy, comparable to a ResNet-50 baseline (99.3%)

Breakthrough Assessment

8/10

Demonstrates that LLM-generated code can replace latent neural representations for complex tasks like image/text classification, offering a rare combination of high performance and native interpretability.

⚙️ Technical Details

Problem Definition

Setting: Supervised learning where the hypothesis class consists of programmatic features f: X -> R and decision tree predictors

Inputs: Raw high-dimensional data x (e.g., images, text, chess boards)

Outputs: Prediction y (e.g., class label or scalar value)

Pipeline Flow

Input Data x
Programmatic Feature Extraction (Run synthesized Python functions f1...fn on x)
Decision Tree / Random Forest Predictor (Uses feature values to predict y)
Output y

System Modules

LLM Feature Generator

Generate Python code for features based on dataset samples or leaf node context

Model or implementation: GPT-4o-mini or GPT-5-mini

Feature Executor

Execute generated Python code on input data to produce scalar feature values

Model or implementation: Python Interpreter

Tree Predictor

Map feature values to final prediction using learned splits

Model or implementation: Random Forest (500 trees, max depth 50) or Decision Tree

Novel Architectural Elements

Dynamic expansion of feature space during tree construction (D-ID3) rather than using fixed features
Coupling of LLM code generation with impurity-based decision tree splitting objectives

Modeling

Base Model: GPT-4o-mini and GPT-5-mini (used as feature generators)

Training Method: Two proposed algorithms: F2 (Features FunSearch) and D-ID3 (Dynamic ID3)

Objective Functions:

Purpose: Select best split feature in D-ID3.

Formally: Minimize impurity (variance for regression, entropy for classification) in child nodes.
Purpose: Guide LLM in F2.

Formally: Maximize feature importance scores extracted from a Random Forest trained on current feature set.

Training Data:

Chess: 200k positions for training, 1k for eval (Lichess)
Image: MNIST and Fashion-MNIST standard splits
Text: Ghostbuster dataset (AI vs Human text)

Key Hyperparameters:

max_features: 1000
rf_n_estimators: 500
rf_max_depth: 50
+ 3 more
f2_iterations: 100
f2_batch_size: 10
d_id3_iterations: 1000

Compute: Not reported in the paper

Comparison to Prior Work

vs. FunSearch: F2 scores features as a set via Random Forest importance rather than individually
vs. Standard Decision Trees: Generates features on-demand during splitting rather than using fixed inputs
vs. Neural Networks (Transformers/ResNet): Interpretable by construction (code + tree) vs. black-box weights
+ 1 more
vs. Ko et al. (2025) [not cited in paper]: Focuses on low-level complex inputs (images/chess) rather than tabular data feature engineering

Limitations

Dependency on LLM API costs and latency during training phase
D-ID3 cannot show actual image examples to the LLM (text-only prompt), limiting vision capabilities
Move prediction accuracy in Chess lags behind state-of-the-art neural engines despite good state-value regression
Requires domain-specific libraries (e.g., python-chess) to be known by the LLM

Reproducibility

Prompt templates provided in Appendix C. Datasets are public (Lichess, MNIST, Ghostbuster). Code not explicitly linked but algorithms described in detail.

📊 Experiments & Results

Evaluation Setup

Supervised learning on Chess (Regression), Image Classification, and Text Classification

Benchmarks:

Chess State-Value (Regression (Win Probability))
MNIST / Fashion-MNIST (Image Classification)
Ghostbuster (Text Classification (AI vs Human))

Metrics:

RMSE (Root Mean Square Error)
Pearson Correlation (rho)
Top-1 Move Accuracy
Accuracy
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Chess results show LeaPR models outperforming a data-rich Transformer baseline in regression (RMSE/Correlation) but lagging in move prediction accuracy.
Chess State-Value	RMSE	0.252	0.245	-0.007
Chess State-Value	Pearson Correlation	0.893	0.908	+0.015
Chess State-Value	Top-1 Move Accuracy	30.3	33.5	+3.2
Image and text classification results demonstrate LeaPR can match neural baselines using interpretable features.
MNIST	Accuracy	99.3	98.6	-0.7
Fashion-MNIST	Accuracy	93.0	89.6	-3.4
Ghostbuster (Text)	F1	99.0	98.8	-0.2

Main Takeaways

LeaPR models achieve data efficiency comparable to or better than neural networks, beating a Transformer trained on 250x more data in Chess RMSE.
D-ID3 (Dynamic ID3) consistently produces better or equal features compared to F2 (Features FunSearch), likely due to the context-aware splitting strategy.
The learned features are human-readable Python code, enabling debugging and verification (e.g., discovering spurious correlations in text data), unlike opaque neural weights.
Performance scales with the strength of the underlying LLM (GPT-5-mini consistently outperforms GPT-4o-mini).

📚 Prerequisite Knowledge

Prerequisites

Decision Tree Learning (ID3 algorithm)
Random Forests
Large Language Models (LLMs) for code generation
Feature Engineering concepts

Key Terms

LeaPR: Learned Programmatic Representations—models combining LLM-written feature functions with decision tree predictors

D-ID3: Dynamic ID3—a modified decision tree algorithm where an LLM generates new features on-the-fly to split specific leaf nodes based on misclassified examples

F2: Features FunSearch—an evolutionary-style algorithm where an LLM iteratively proposes batches of features to maximize their importance scores in a Random Forest

FunSearch: A method to search for functions in code space using an LLM and an evaluator, originally used for solving mathematical problems

ID3: Iterative Dichotomiser 3—a classical algorithm used to generate a decision tree from a dataset

SHAP values: SHapley Additive exPlanations—a game theoretic approach to explain the output of any machine learning model by attributing importance to input features

RMSE: Root Mean Square Error—a standard metric for regression tasks measuring the average magnitude of the error

programmatic features: Input features defined as executable code (Python functions) rather than static values or learned neural weights

impurity: A measure (like entropy or variance) used in decision trees to quantify how 'mixed' the labels are at a specific node