Video Pretraining Advances 3D Deep Learning on Chest CT Tasks

📝 Paper Summary

Medical Imaging (CT) Video Pretraining 3D Deep Learning

Pretraining 3D models on large-scale natural videos (Kinetics-400) significantly improves performance on 3D chest CT tasks, enabling them to outperform 2D baselines even on small datasets.

Core Problem

3D medical imaging tasks lack standard large-scale pretraining datasets (like ImageNet for 2D), causing practitioners to rely on 2D models that ignore native 3D/temporal information.

Why it matters:

3D medical datasets are small and expensive to curate, making models prone to overfitting without effective pretraining
Current approaches adapting 2D ImageNet weights to 3D models fail to capture cross-sectional anatomical relationships
2D models remain the de-facto standard in medical imaging despite losing volumetric context, stifling the adoption of more capable 3D architectures

Concrete Example: A 2D ResNet processes a CT scan slice-by-slice, missing the continuity of a blood vessel across slices. A 3D model trained from scratch on small medical data overfits. This paper shows a 3D model pretrained on YouTube action videos (e.g., 'playing darts') learns spatiotemporal features that transfer effectively to tracking vessels in 3D CT scans.

Key Novelty

Transfer Learning from Natural Video to 3D Medical Imaging

Leverage the structural similarity between the time dimension in videos and the depth (z-axis) dimension in CT scans to learn spatiotemporal features
Systematic evaluation across seven 3D architectures and two distinct clinical tasks proves video pretraining is a universal booster, not just a model-specific trick
Demonstrates that large-scale out-of-domain pretraining (YouTube videos) is more effective than smaller-scale in-domain pretraining (CT scans) for 3D models

Architecture

Conceptual workflow of the pretraining and finetuning strategies

Evaluation Highlights

+0.146 AUC improvement on average for 3D models on PE detection when pretrained on video vs. training from scratch
Video-pretrained Swin-T outperforms the best 2D baseline (LRCN) by +0.118 AUC on RSNA PE detection
Video pretraining on Kinetics-400 outperforms in-domain pretraining on Stanford CTs by +0.182 mean AUC on PE detection

Breakthrough Assessment

8/10

Strongly challenges the dominance of 2D models in medical imaging by providing a proven recipe (video pretraining) to make 3D models effective, even with limited medical data.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of 3D volumetric windows from Chest CT scans

Inputs: A sliding window of 32 consecutive CT slices (256x256 pixels each)

Outputs: Binary prediction: Presence/Absence of Pulmonary Embolism (PE) or Lung Nodule

Pipeline Flow

Pretraining (Kinetics-400 Video Classification)
Weight Transfer (Replace Classification Head)
Finetuning (CT Slice Window Classification)

System Modules

3D Backbone

Extract spatiotemporal features from input volume

Model or implementation: Various 3D models (Swin-T, MViT, SlowFast, etc.)

Classification Head

Predict binary probability of pathology

Model or implementation: Linear layer with dropout

Novel Architectural Elements

Application of video-specific architectures (SlowFast, CSN, R(2+1)D) to volumetric medical imaging data by treating the z-axis (depth) as the time dimension

Modeling

Base Model: 7 distinct 3D models: MViT, R(2+1)D R50, CSN R101, SlowFast R50, Slow R50, Swin-T, PENet

Training Method: Supervised Finetuning

Objective Functions:

Purpose: Minimize prediction error on binary classification tasks.

Formally: Binary Cross-Entropy Loss

Training Data:

RSNA PE: 5,095 training studies (upsampled positive studies)
LIDC Nodule: 714 training studies
Inputs clipped to Hounsfield Units [400, 1000] (PE) or [-600, 1500] (Nodule)

Key Hyperparameters:

learning_rate: Search over 10^-1 to 10^-5
batch_size: 16 for 3D models, 32 for 2D models
optimizer: Adam (beta1=0.9, beta2=0.999)
+ 3 more
epochs: Max 50 with early stopping
input_resolution: 224x224 (random crop from 256x256)
window_size: 32 slices (24 for PENet)

Compute: Used half precision (16-bit) to lower memory requirements. Specific GPU type/count not reported in the paper.

Comparison to Prior Work

vs. 2D ImageNet Transfer: Natively models 3D context; outperforms 2D baselines on small data
vs. Models Genesis: Uses supervised video pretraining (Kinetics) rather than in-domain self-supervised learning
vs. ACS: Flexible across architectures (Supports Transformers like Swin/MViT), not limited to inflating 2D CNNs

Limitations

In-domain pretraining dataset (RadFusion) was relatively small (1,241 studies) compared to Kinetics-400
Comparison restricted to supervised pretraining; does not evaluate self-supervised learning (SSL) interactions
Limited to Chest CT modality; generalization to MRI or Ultrasound not tested
Did not statistically assess individual model improvements, only aggregate performance

Reproducibility

Code: https://github.com/rajpurkarlab/chest-ct-pretraining

publicly available (https://github.com/rajpurkarlab/chest-ct-pretraining). Pretrained weights for 3D models sourced from PyTorchVideo. Datasets (RSNA PE, LIDC-IDRI) are public. RadFusion (Stanford CT) dataset used for in-domain pretraining is not public.

📊 Experiments & Results

Evaluation Setup

Binary classification on held-out test sets from RSNA (PE) and LIDC (Nodule) datasets

Benchmarks:

RSNA PE CT (Pulmonary Embolism Detection)
LIDC-IDRI (Lung Nodule Detection)

Metrics:

AUC (Area Under the ROC Curve)
Statistical methodology: Nonparametric bootstrap for 95% confidence intervals

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of video pretraining on 3D model performance compared to training from scratch.
RSNA PE (100% data)	Mean AUC	0.573	0.719	+0.146
LIDC Nodule (100% data)	Mean AUC	0.642	0.766	+0.124
Comparison of best 3D models (video pretrained) against best 2D models (ImageNet pretrained).
RSNA PE (100% data)	AUC	0.770	0.888	+0.118
LIDC Nodule (100% data)	AUC	0.833	0.885	+0.052
Comparison of Out-of-Domain (Video) Pretraining vs In-Domain (CT) Pretraining.
RSNA PE (100% data)	Mean AUC	0.537	0.719	+0.182

Experiment Figures

Bar charts comparing AUC of 10 models (3 2D, 7 3D) with and without video/image pretraining on RSNA and LIDC

Line plots comparing Max 3D AUC vs Max 2D AUC across dataset sizes (1%, 10%, 100%)

Main Takeaways

Video pretraining universally improves 3D model performance across all tested architectures and tasks
3D models with video pretraining significantly outperform 2D ImageNet-pretrained baselines, overturning the common preference for 2D models in medical imaging
Large-scale out-of-domain pretraining (Kinetics-400) is more effective than smaller-scale in-domain pretraining (Stanford CT), suggesting scale matters more than domain alignment for initial feature learning
Sequential pretraining (Video then CT) yielded mixed results, often underperforming pure Video pretraining, suggesting potential negative interference or overfitting on the small in-domain CT set

📚 Prerequisite Knowledge

Prerequisites

Understanding of 3D CNNs vs 2D CNNs
Transfer learning concepts (pretraining vs finetuning)
Basic medical imaging terminology (CT, slices, nodules, PE)

Key Terms

CT: Computed Tomography—a medical imaging technique that uses X-rays to create detailed pictures of the inside of the body

PE: Pulmonary Embolism—a blockage in one of the pulmonary arteries in the lungs

Kinetics-400: A large-scale dataset of ~300k 10-second YouTube video clips annotated with 400 human action classes, used here for pretraining

RSNA: Radiological Society of North America—refers here to a specific public dataset for Pulmonary Embolism detection

LIDC-IDRI: Lung Image Database Consortium image collection—a public dataset for lung nodule detection

AUC: Area Under the Receiver Operating Characteristic Curve—a performance metric where 1.0 is perfect and 0.5 is random guessing

Swin-T: Swin Transformer (Tiny)—a hierarchical vision transformer that computes self-attention within local windows

MViT: Multiscale Vision Transformer—a transformer architecture for video that learns a hierarchy of representations

PENet: Pulmonary Embolism Network—a specialized 3D CNN architecture designed specifically for PE detection

R(2+1)D: Residual Network with (2+1)D convolutions—decomposes 3D convolutions into separate 2D spatial and 1D temporal convolutions

CSN: Channel-Separated Convolutional Network—a video classification model that uses group convolutions to reduce parameters

LRCN: Long-term Recurrent Convolutional Network—a 2D CNN backbone (feature extractor) connected to an RNN (like LSTM/GRU) to model temporal sequences