← Back to Paper List

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding

Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro
NVIDIA
arXiv (2025)
MM Reasoning Benchmark Speech

📝 Paper Summary

Audio Language Models (ALMs) Chain-of-Thought (CoT) Reasoning
Audio Flamingo Sound-CoT improves sound understanding in audio language models by fine-tuning on a large-scale synthetic chain-of-thought dataset generated via novel ALM-LLM interactive pipelines.
Core Problem
Current Audio Language Models (ALMs) lack the deep reasoning capabilities found in LLMs/VLMs because high-quality, audio-specific Chain-of-Thought (CoT) training data is scarce and hard to scale.
Why it matters:
  • Existing synthetic CoT methods rely on captions or metadata, ignoring audio-specific reasoning needs (e.g., temporal relationships, subtle acoustic properties)
  • Standard benchmarks focus on surface-level perception, failing to measure common-sense reasoning or fine-grained discrimination between similar sounds
  • Without explicit reasoning training, ALMs struggle with complex tasks requiring intermediate steps, limiting their robustness and transparency
Concrete Example: When asked 'Where does this activity likely happen?', a standard ALM might guess based on a single sound. In contrast, the proposed method generates a reasoning chain identifying multiple sound events (e.g., wind, waves) and ruling out distractors (e.g., 'no traffic sounds') before concluding 'beach'.
Key Novelty
Audio Flamingo Sound-CoT & AF-CoT-Train
  • Develops four data generation pipelines where a text-LLM and an ALM interactively generate reasoning chains (e.g., LLM asks sub-questions, ALM answers based on audio), ensuring audio-grounded reasoning
  • Constructs AF-Reasoning-Eval, a benchmark specifically designing 'hard negatives' (closely related choices) and common-sense QA to rigorously test reasoning beyond simple recognition
Evaluation Highlights
  • Audio Flamingo 3 Sound-CoT achieves state-of-the-art 79.83% on MMAU-Sound, outperforming GPT-4o Audio (63.20%) and Gemini-2.5-Pro (70.63%)
  • Audio Flamingo 2 Sound-CoT (3B) outperforms larger 7B baselines like Qwen2-Audio on AF-Reasoning-Eval-AQA (+12.16% vs Audio-Reasoner)
  • Significant gains on fine-grained classification: +40.93% improvement on AF-Reasoning-Eval-CLS-full for Audio Flamingo 2 after CoT fine-tuning
Breakthrough Assessment
8/10
Strong practical contribution: creates the largest open audio CoT dataset (1.24M) using a novel interactive generation method and sets new SOTA on multiple benchmarks, surpassing proprietary models.
×