← Back to Paper List

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S. Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, S Ramaneswaran, Oriol Nieto, R. Duraiswami, Sreyan Ghosh, Dinesh Manocha
University of Maryland, College Park, USA
International Conference on Learning Representations (2024)
MM Speech Benchmark Reasoning KG

📝 Paper Summary

Audio-Language Models Multimodal Benchmarking
MMAU is a large-scale benchmark challenging AI models with expert-annotated questions across speech, music, and sound, revealing that current systems significantly lag behind human performance in complex audio reasoning.
Core Problem
Existing audio benchmarks focus on foundational perception tasks (like ASR or simple classification) that do not require the complex reasoning or expert-level domain knowledge essential for Artificial General Intelligence.
Why it matters:
  • Current benchmarks like OpenASQA or MusicBench are limited to specific domains or simple tasks, failing to test the '90th percentile of skilled adults' standard for AGI
  • Large Audio-Language Models (LALMs) are advancing rapidly but are evaluated on tasks solvable by young children, masking their inability to perform deliberate, expert-level reasoning
  • No existing benchmark covers the breadth (all audio domains) and depth (complex reasoning) required to rigorously assess modern multimodal models
Concrete Example: While current models can transcribe speech (ASR), they fail tasks like 'Temporal Acoustic Event Analysis' or 'Emotional Shift Detection'—answering *why* a speaker's emotion changed based on background sounds—which MMAU specifically tests.
Key Novelty
Massive Multi-Task Audio Understanding (MMAU) Benchmark
  • Comprehensive coverage of 3 distinct audio domains (Speech, Music, Sound) with 10,000 expert-annotated samples, unlike prior single-domain benchmarks
  • Focus on 'Depth' via 27 distinct expert skills, separating tasks into Information Extraction (requiring world knowledge) and Reasoning (requiring complex cognitive processing)
  • Rigorous curation pipeline involving domain experts and GPT-4 based option augmentation to ensure questions are challenging and non-trivial
Evaluation Highlights
  • Gemini Pro 1.5, the top-performing proprietary model, achieves only 52.97% accuracy, significantly lagging behind the Human baseline of 81.85%
  • Qwen2-Audio-Instruct achieves 52.50%, demonstrating that open-source models are competitive with proprietary ones in the audio domain
  • A cascaded approach using Qwen2-Audio captions + GPT-4o achieves 59.08%, outperforming all end-to-end LALMs by decoupling perception and reasoning
Breakthrough Assessment
9/10
A much-needed benchmark that exposes the 'reasoning gap' in audio AI. By moving beyond ASR/classification to expert reasoning, it sets a new standard for LALM evaluation.
×