ALM: Audio-Language Model—a multimodal AI that can process and reason about non-speech audio and music using natural language.
CLAP: Contrastive Language-Audio Pre-training—a method to learn shared embeddings for audio and text by maximizing similarity between matched pairs.
AF-CLAP: The authors' improved CLAP encoder, trained with linguistically diverse positives and composition-aware negatives to improve robustness.
XATTN-Dense: Gated Cross-Attention Dense layers—architectural components inserted into the LLM to inject audio information while keeping the LLM weights frozen.
RoPE: Rotary Positional Embeddings—a method to encode positional information into embeddings, used here to track temporal order in sliding audio windows.
HTSAT: Hierarchical Token-Semantic Audio Transformer—a specific transformer-based audio encoder architecture used as the backbone for AF-CLAP.
MMAU: Multi-Modal Audio Understanding—a benchmark dataset for evaluating expert-level reasoning in audio models.
Curriculum Learning: A training strategy where the model is trained on progressively more difficult or diverse data stages.