LALM: Large Audio Language Model—an LLM extended to process audio inputs via an encoder and adapter.
MoE: Mixture-of-Experts—a neural network architecture where different subsets of parameters (experts) are activated for different inputs.
Gradient Conflict: A phenomenon where the gradient updates required for one task or data type oppose those required for another, canceling out progress.
Paralinguistic: Non-verbal aspects of speech and audio, such as tone, emotion, background noise, or speaker identity, distinct from linguistic content.
Modality Gap: The geometric distance between the embeddings of paired audio and text representations; a smaller gap implies better alignment.
NTP: Next-Token Prediction—the standard training objective for autoregressive language models.
SiLU: Sigmoid Linear Unit—an activation function used in the experts.
Top-k routing: A mechanism that selects the k experts with the highest router scores for a given input token.
Whisper: A speech recognition model used here as the audio encoder backbone.