InfoNCE loss: A loss function used in contrastive learning that maximizes the similarity between positive pairs while minimizing similarity with negative pairs.
SimCSE: Simple Contrastive Sentence Embeddings—a method using dropout noise as data augmentation for self-supervised contrastive learning.
R-Drop: A regularization method that forces the output distributions of two sub-models (generated via dropout) to be consistent.
In-modal contrastive learning: Aligning representations from the same modality (e.g., Audio vs. Audio) derived from the same input.
Cross-modal contrastive learning: Aligning representations from different modalities (e.g., Audio vs. Text) belonging to the same pair.
IEMOCAP: Interactive Emotional Dyadic Motion Capture database—a standard benchmark dataset for speech emotion recognition.
WA: Weighted Accuracy—overall classification accuracy.
UA: Unweighted Accuracy—average accuracy across all classes, treating them equally.