MLLM: Multimodal Large Language Model—an LLM capable of processing non-text inputs like images or audio
Instruction Tuning: Fine-tuning a pre-trained language model on a dataset of instruction-response pairs to improve its ability to follow user commands
HuBERT: Hidden Unit BERT—a self-supervised speech representation model used here as the audio encoder
Action Units (AUs): Fundamental actions of individual muscles or groups of muscles in the face (e.g., 'brow lowerer'), used to code facial expressions
MAE: Masked Autoencoder—a vision model trained to reconstruct missing parts of an image, effective for learning visual features
Clue Overlap: A metric measuring how well the model's predicted emotional clues (reasons) match the ground truth
UAR: Unweighted Average Recall—a classification metric that averages recall across classes, useful for imbalanced datasets
WAR: Weighted Average Recall—standard accuracy where classes are weighted by their prevalence
Zero-shot: Testing a model on tasks or classes it has not explicitly seen during training