← Back to Paper List

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S. Sakshi, Oriol Nieto, R. Duraiswami, Dinesh Manocha
University of Maryland, College Park, Adobe
Conference on Empirical Methods in Natural Language Processing (2024)
MM Speech Reasoning Benchmark

📝 Paper Summary

Audio-Language Models (LALMs) Audio Understanding Instruction Tuning
GAMA integrates multiple audio encoders and a multi-layer aggregator into an LLM, utilizing a new complex reasoning dataset (CompA-R) to improve open-ended audio question answering.
Core Problem
Existing Large Audio-Language Models (LALMs) use simple connection modules that hinder comprehensive multimodal alignment and fail at complex reasoning tasks requiring nuanced understanding of acoustic events and their contexts.
Why it matters:
  • Current models excel at simple event detection but hallucinate or fail on questions involving complex reasoning, such as inferring relationships between overlapping sounds and their scenarios.
  • Simple linear couplings between audio encoders and LLMs risk suboptimal performance and hallucinations due to insufficient multimodal alignment.
  • Improving non-speech sound understanding is critical for autonomous agents to interact with the world beyond just visual and spoken language perception.
Concrete Example: For an audio clip containing laughter and automotive sounds, current models might just list the sounds. GAMA can answer 'Identifying the context of laughter and its relationship with the automotive sounds... Draw a conclusion on the possible scenario occurring' by inferring a specific scenario.
Key Novelty
GAMA (General-purpose Large Audio-Language Model)
  • Integrates features from two distinct audio encoders (Audio Q-Former and Audio Spectrogram Transformer) to capture both semantic generalization and surface-level audio properties.
  • Uses a multi-layer aggregator to combine features from different layers of the audio encoder, capturing information at various scales (generic sounds vs. complex patterns).
  • Introduces a soft-prompting mechanism that incorporates high-level semantic evidence (event tags) during instruction tuning to aid complex reasoning.
Evaluation Highlights
  • GAMA outperforms LALM baselines (LTU, SALMONN, Pengi) on diverse audio understanding tasks by margins of 1%-84%.
  • On the new CompA-R-test benchmark for complex reasoning, GAMA-IT achieves a GPT-4 evaluative score of 4.3/4.5 (Clarity/Correctness), significantly surpassing LTU (3.5/4.0) and SALMONN (2.6/2.8).
  • Ablation studies confirm the multi-layer aggregator and Audio Q-Former contribute positively, with removing the aggregator dropping performance on OpenAQA by ~0.2 points.
Breakthrough Assessment
7/10
Strong engineering combination of multiple audio representations and a well-motivated synthetic dataset for complex reasoning. While architectural novelty is evolutionary (stacking encoders), the focus on complex reasoning scenarios pushes the state-of-the-art significantly.
×