Fine-tuning with Very Large Dropout

📝 Paper Summary

Transfer Learning Out-of-Distribution Generalization

Fine-tuning pre-trained networks with extreme dropout rates (~90%) forces the model to retain redundant features, significantly improving out-of-distribution robustness compared to ensembles and weight averaging.

Core Problem

Standard stochastic gradient descent (SGD) has an implicit sparsity bias that discards 'weakly relevant' features—redundant features that are unnecessary for the training distribution but crucial for robustness under distribution shifts.

Why it matters:

Modern machine learning relies on fine-tuning pre-trained models on small datasets, where overfitting to the specific training distribution (and losing versatile features) is a major risk.
Current methods like Deep Ensembles or Model Soups are computationally expensive or complex, yet they still struggle to fully capture the rich representations needed for OOD generalization.

Concrete Example: In a logistic regression task where multiple features perfectly predict the target, SGD will learn only one 'strongly relevant' feature and starve the gradients for others. If the target distribution changes such that the learned feature is missing, the model fails. A model forced to learn all redundant features would survive.

Key Novelty

Very-Large Dropout Fine-tuning

Apply an extremely high dropout rate (e.g., 90%) to the penultimate layer during fine-tuning, which is impossible when training from scratch but viable when starting from rich pre-trained representations.
This acts as a massive masking operation, forcing the final classifier to utilize every available subset of features in the representation layer, thereby preserving 'weakly relevant' (redundant) features that aid OOD robustness.

Evaluation Highlights

+1.3% accuracy improvement on TerraIncognita (OOD) using ResNet50 compared to state-of-the-art Weight Averaging (Model Soups) baselines.
Consistent gains across 4 major OOD benchmarks (PACS, VLCS, OfficeHome, TerraIncognita) over both Deep Ensembles and Weight Averaging.
Achieves these gains using a simple training modification (dropout rate change) without complex ensemble engineering or auxiliary losses.

Breakthrough Assessment

7/10

A simple, counter-intuitive finding (using 90% dropout) that outperforms complex state-of-the-art methods like Model Soups on key OOD benchmarks. Highly practical for fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Three-distributions setup: Pre-train on distribution Tp (large), Fine-tune on Td (small), Test on T~d (shifted/OOD).

Inputs: Images from domain-specific datasets (e.g., sketches, photos, paintings)

Outputs: Class labels (classification)

Pipeline Flow

Input Image
Pre-trained Backbone (ResNet/ViT)
Penultimate Representation
Dropout (Training only)
Linear Classifier
Output Class

System Modules

Backbone

Extracts high-level features from input images using residual blocks

Model or implementation: ResNet50 or ViT-L-16 (Pre-trained on ImageNet)

Very-Large Dropout

Randomly masks a massive proportion of features to force redundant feature usage

Model or implementation: Dropout Layer (p=0.9)

Classifier Head

Maps features to class probabilities

Model or implementation: Linear Layer

Novel Architectural Elements

Integration of extremely high dropout (90%) specifically at the penultimate layer of pre-trained residual networks during the fine-tuning phase

Modeling

Base Model: ResNet50 (25M params) and ViT-L-16 (304M params)

Training Method: Stochastic Gradient Descent (SGD) with Momentum

Objective Functions:

Purpose: Minimize classification error while forcing feature redundancy.

Formally: Cross-Entropy Loss with 90% Dropout on representation.

Adaptation: Full fine-tuning of all weights

Trainable Parameters: All parameters (approx 25M for ResNet50)

Training Data:

Fine-tuning datasets: PACS (9,991 examples), VLCS, OfficeHome, TerraIncognita
Pre-training dataset: ImageNet (1.2M examples)

Key Hyperparameters:

dropout_rate: 0.9
iterations: 10000
batch_size: 32 (ResNet), 16 (ViT)
+ 4 more
learning_rate: 1e-3 or 5e-4
weight_decay: 1e-4, 5e-5, or 1e-5
momentum: 0.9
lr_decay: 10% decay at 5000 iterations

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Deep Ensembles: Achieves better OOD performance with a single fine-tuning protocol (followed by averaging) rather than just ensembling outputs
vs. Weight Averaging: The proposed method uses Weight Averaging as a final step but changes the *training* dynamics via 90% dropout, yielding better constituent models than standard WA
vs. Standard Dropout: Uses p=0.9 instead of p=0.5; applied only during fine-tuning, not pre-training

Limitations

Only evaluated on classification tasks (ResNet/ViT on image benchmarks)
Requires a pre-trained model (training from scratch with 90% dropout is practically impossible)
Performance gain depends on the availability of redundant features in the pre-trained representation
ViT-L-16 performance generally lagged behind ResNet50 on these specific small OOD datasets

Reproducibility

Code: https://github.com/TjuJianyu/verylarge_dropout

Code is publicly available at https://github.com/TjuJianyu/verylarge_dropout. The paper specifies exact hyperparameters (learning rates, weight decays, batch sizes) and training schedules. Refined data augmentations (TrivialAugment, CutMix, RandomErasing) are used for pre-training, consistent with prior baselines.

📊 Experiments & Results

Evaluation Setup

Fine-tune on 3 domains, test on 4th domain (Leave-one-domain-out OOD evaluation)

Benchmarks:

PACS (Domain Generalization (Photo, Art, Cartoon, Sketch))
VLCS (Object Recognition (PASCAL, LabelMe, Caltech, Sun))
OfficeHome (Domain Adaptation (Art, Clipart, Product, Real))
TerraIncognita (Wild animal classification across locations)

Metrics:

Out-of-Distribution (OOD) Accuracy
Statistical methodology: Average over the 4 choices of held-out target domains; hyperparams selected based on validation set of training domains.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ResNet50 OOD Performance comparisons showing Very-Large Dropout (combined with Weight Averaging) consistently outperforming the strongest baselines.
PACS	OOD Accuracy	86.6	87.7	+1.1
VLCS	OOD Accuracy	79.3	80.1	+0.8
OfficeHome	OOD Accuracy	70.3	71.3	+1.0
TerraIncognita	OOD Accuracy	51.3	52.6	+1.3

Main Takeaways

Very-large dropout (90%) consistently improves OOD generalization across diverse datasets (Photos, Sketches, Satellite imagery) compared to standard fine-tuning.
The method outperforms both Deep Ensembles and Weight Averaging, suggesting that forcing feature redundancy during optimization is more effective than post-hoc aggregation alone.
The approach is specific to residual networks (ResNet) and pre-trained models; the skip connections allow the linear head to access inner features blocked by dropout, maintaining flow.
Fine-tuning is confirmed to be a 'near-linear' process, allowing aggressive regularization that would fail during training from scratch.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transfer Learning and Fine-tuning
Familiarity with Dropout regularization
Concepts of In-Distribution (IID) vs Out-of-Distribution (OOD) generalization

Key Terms

OOD: Out-of-Distribution—data that differs from the training data in style, environment, or other covariates while sharing the same task

Weight Averaging: Averaging the weights of multiple fine-tuned models (e.g., Model Soups) to approximate an ensemble, often improving robustness

Sparsity Bias: The tendency of optimization algorithms like SGD to rely on the smallest subset of features necessary to solve the training task, often discarding redundant information

Weakly Relevant Features: Features that are redundant/unnecessary for the training distribution (due to correlations) but may become necessary if the primary features are corrupted or missing in the test distribution

Three-distributions setup: A transfer learning framework involving a pre-training distribution, a fine-tuning distribution, and a distinct testing distribution

ERM: Empirical Risk Minimization—standard training that minimizes the average loss on the training data

Linear Connectivity: The property where linearly interpolating between the weights of two neural networks yields models with low loss, enabling weight averaging

Penultimate Layer: The layer of a neural network immediately preceding the final classification head (linear layer)