← Back to Paper List

Fine-tuning with Very Large Dropout

Jianyu Zhang, Léon Bottou
Not explicitly reported in the paper
arXiv (2024)
Pretraining Benchmark

📝 Paper Summary

Transfer Learning Out-of-Distribution Generalization
Fine-tuning pre-trained networks with extreme dropout rates (~90%) forces the model to retain redundant features, significantly improving out-of-distribution robustness compared to ensembles and weight averaging.
Core Problem
Standard stochastic gradient descent (SGD) has an implicit sparsity bias that discards 'weakly relevant' features—redundant features that are unnecessary for the training distribution but crucial for robustness under distribution shifts.
Why it matters:
  • Modern machine learning relies on fine-tuning pre-trained models on small datasets, where overfitting to the specific training distribution (and losing versatile features) is a major risk.
  • Current methods like Deep Ensembles or Model Soups are computationally expensive or complex, yet they still struggle to fully capture the rich representations needed for OOD generalization.
Concrete Example: In a logistic regression task where multiple features perfectly predict the target, SGD will learn only one 'strongly relevant' feature and starve the gradients for others. If the target distribution changes such that the learned feature is missing, the model fails. A model forced to learn all redundant features would survive.
Key Novelty
Very-Large Dropout Fine-tuning
  • Apply an extremely high dropout rate (e.g., 90%) to the penultimate layer during fine-tuning, which is impossible when training from scratch but viable when starting from rich pre-trained representations.
  • This acts as a massive masking operation, forcing the final classifier to utilize every available subset of features in the representation layer, thereby preserving 'weakly relevant' (redundant) features that aid OOD robustness.
Evaluation Highlights
  • +1.3% accuracy improvement on TerraIncognita (OOD) using ResNet50 compared to state-of-the-art Weight Averaging (Model Soups) baselines.
  • Consistent gains across 4 major OOD benchmarks (PACS, VLCS, OfficeHome, TerraIncognita) over both Deep Ensembles and Weight Averaging.
  • Achieves these gains using a simple training modification (dropout rate change) without complex ensemble engineering or auxiliary losses.
Breakthrough Assessment
7/10
A simple, counter-intuitive finding (using 90% dropout) that outperforms complex state-of-the-art methods like Model Soups on key OOD benchmarks. Highly practical for fine-tuning.
×