← Back to Paper List

MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

H Dong, P Zhang, M Lu, Y Shen, G Ke
University of Chinese Academy of Sciences, South China University of Technology, Stanford University
arXiv, 9/2025 (2025)
Pretraining Reasoning Benchmark

📝 Paper Summary

Tabular Machine Learning In-Context Learning (ICL)
MACHINELEARNINGLM equips general-purpose LLMs with robust tabular prediction capabilities by continuing pretraining on millions of synthetic causal tasks, enabling scalable many-shot in-context learning without fine-tuning.
Core Problem
General LLMs struggle to learn from many-shot examples on tabular ML tasks, often relying on surface-level cues rather than uncovering causal mechanisms, while specialized tabular models lack general world knowledge.
Why it matters:
  • LLMs frequently fail to improve accuracy even when given many examples, plateauing quickly unlike traditional ML methods
  • Current tabular-specific ICL models cannot process multimodal inputs or leverage external knowledge because they abandon the LLM architecture
  • Existing methods for tabular LLMs (like TabLLM) require expensive task-specific fine-tuning rather than offering a portable, inference-only solution
Concrete Example: When given 64 examples of a complex tabular task, standard LLMs often perform no better than random guessing or simply predict the majority class because they cannot robustly model numerical relationships in context.
Key Novelty
Synthetic SCM-based Continued Pretraining + Token-Efficient Prompting
  • Pretrains the LLM on millions of synthetic tasks generated from Structural Causal Models (SCMs) to teach it 'how to do ML' generally, rather than memorizing specific datasets
  • Uses a 'warm-up' phase where the model mimics a Random Forest teacher to stabilize training and learn robust decision boundaries before switching to self-supervised prediction
  • Compresses tabular data using a compact integer encoding and processes batches of test queries in a single forward pass to handle up to 1,024 shots efficiently
Evaluation Highlights
  • Outperforms GPT-5-mini by ~12% and o3-mini by ~16% on average at high shot counts (128–1024) across 32 diverse tabular datasets
  • Achieves Random-Forest–level accuracy across 8 to 512 shots (within 2% relative accuracy) purely via in-context learning without gradient updates
  • Maintains general chat capabilities, achieving 75.4% on MMLU (50-shot), comparable to the base Qwen-2.5-7B-Instruct model
Breakthrough Assessment
8/10
Significantly advances tabular ICL by demonstrating monotonic many-shot scaling in LLMs, a capability previously absent. Effectively bridges the gap between generalist LLMs and specialized tabular models.
×