← Back to Paper List

On the Provable Advantage of Unsupervised Pretraining

Jiawei Ge, Shange Tang, Jianqing Fan, Chi Jin
Princeton University
arXiv (2023)
Pretraining Reasoning

📝 Paper Summary

Unsupervised Learning Theory Representation Learning Sample Complexity Analysis
The paper proves that unsupervised pretraining via Maximum Likelihood Estimation followed by supervised fine-tuning reduces sample complexity compared to supervised learning by leveraging unlabeled data to learn latent structures.
Core Problem
Despite empirical success, there is a lack of rigorous theoretical understanding of why unsupervised pretraining improves sample efficiency, with most existing proofs limited to specific methods or failing to show advantages over pure supervised baselines.
Why it matters:
  • Current theories often rely on restrictive assumptions (e.g., specific to contrastive learning) that do not generalize to other pretraining paradigms
  • Practitioners need to know when and why pretraining helps—specifically, under what conditions unlabeled data can substitute for expensive labeled data
Concrete Example: In a factor model where high-dimensional data (dimension d) is driven by low-dimensional factors (dimension r), pure supervised learning requires O(d) labeled samples to learn the mapping. The proposed theory shows pretraining can learn the d-dimensional structure from unlabeled data, leaving only the r-dimensional task for labeled data.
Key Novelty
Generic Two-Phase MLE+ERM Framework with Informative Condition
  • Decouples learning into finding the latent variable model (via MLE on unlabeled data) and the prediction function (via ERM on labeled data)
  • Introduces a 'transformation-invariant' informative condition: pretraining is useful if it recovers the latent structure up to transformations (e.g., rotations) that the downstream task can adapt to
  • Captures diverse methods (Factor Models, GMMs, Contrastive Learning) under a single theoretical umbrella
Evaluation Highlights
  • Proves excess risk bound of roughly O(sqrt(C_phi/m) + sqrt(C_psi/n)), showing benefit when unlabeled data m >> n
  • Factor Models: Pretraining improves rate to O(d/m + r/n) compared to supervised baseline O(d/n), significant when dimension d is large
  • Gaussian Mixture Models: Pretraining improves rate to O(sqrt(dK/m) + sqrt(K/n)) compared to supervised baseline O(sqrt(dK/n))
Breakthrough Assessment
8/10
Provides a unified, rigorous theoretical justification for pretraining across multiple domains (linear, clustering, contrastive) with explicit sample complexity improvements over supervised baselines.
×