CT: Computed Tomography—a medical imaging technique that uses X-rays to create detailed pictures of the inside of the body
PE: Pulmonary Embolism—a blockage in one of the pulmonary arteries in the lungs
Kinetics-400: A large-scale dataset of ~300k 10-second YouTube video clips annotated with 400 human action classes, used here for pretraining
RSNA: Radiological Society of North America—refers here to a specific public dataset for Pulmonary Embolism detection
LIDC-IDRI: Lung Image Database Consortium image collection—a public dataset for lung nodule detection
AUC: Area Under the Receiver Operating Characteristic Curve—a performance metric where 1.0 is perfect and 0.5 is random guessing
Swin-T: Swin Transformer (Tiny)—a hierarchical vision transformer that computes self-attention within local windows
MViT: Multiscale Vision Transformer—a transformer architecture for video that learns a hierarchy of representations
PENet: Pulmonary Embolism Network—a specialized 3D CNN architecture designed specifically for PE detection
R(2+1)D: Residual Network with (2+1)D convolutions—decomposes 3D convolutions into separate 2D spatial and 1D temporal convolutions
CSN: Channel-Separated Convolutional Network—a video classification model that uses group convolutions to reduce parameters
LRCN: Long-term Recurrent Convolutional Network—a 2D CNN backbone (feature extractor) connected to an RNN (like LSTM/GRU) to model temporal sequences