LALM: Large Audio-Language Model—a multimodal model capable of understanding and reasoning about audio inputs using a Large Language Model.
AST: Audio Spectrogram Transformer—a purely attention-based model for audio classification that processes audio spectrograms as patches.
Q-Former: Querying Transformer—a module that bridges a frozen image/audio encoder and a frozen LLM, using learnable query vectors to extract relevant features.
CompA-R: Instruction-Tuning for Complex Audio Reasoning—the novel synthetic dataset created in this paper containing instructions requiring complex reasoning.
Soft Prompt: A trainable vector sequence inserted into the input embedding space to steer the model's behavior, used here to adaptively incorporate audio event tags.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.
OpenAQA: Open Audio Question Answering—a benchmark dataset for evaluating audio understanding.
Dense Captioning: A task requiring the model to identify every event in the audio and the context of its occurrence with respect to other events.
CLAP: Contrastive Language-Audio Pretraining—a model trained to align audio and text representations in a shared latent space.