PerceiverIO: A transformer architecture that maps diverse inputs to a fixed-size latent space, allowing efficient handling of high-dimensional multimodal data
CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images with text descriptions, used here for visual feature extraction
AST: Audio Spectrogram Transformer—a model adapted from vision transformers to process audio spectrograms for classification
Macro-F1: A metric that calculates F1 score for each class independently and then averages them, treating all classes equally regardless of size
Zero-shot reasoning: Using a pre-trained model (like GPT-4) to perform a task without any specific training examples
Tone Transition: A binary classification task determining if the perceived affective tone changes between the start, middle, and end of a video
A-Max: A fusion strategy where the logits (predictions) from two separate models are averaged, and the maximum value determines the class
D-Max: A fusion strategy where the maximum logit value across two models is selected directly to determine the class