AF-CoT-Train: The proposed large-scale synthetic dataset (1.24M samples) containing audio-specific chain-of-thought reasoning paths
AF-Reasoning-Eval: A new benchmark proposed in this paper consisting of two subsets: AQA (common sense reasoning) and Classification (distinguishing closely related sounds)
ALM: Audio Language Model—a multimodal model capable of understanding and reasoning about audio inputs
BFS-style search: Breadth-First Search—a data generation strategy where the LLM generates multiple parallel sub-questions to be answered by the ALM
DFS-style search: Depth-First Search—an interactive data generation strategy where the LLM and ALM have a multi-turn conversation to deepen reasoning
Qwen2.5-Omni: The specific ALM used as the 'teacher' model in the data generation pipeline to provide audio insights
Audio Flamingo: The specific family of ALMs (versions 2 and 3) used as the base models for fine-tuning in this study
MMAU: Multi-Modal Audio Understanding—a standard benchmark for evaluating audio understanding capabilities
hard negatives: Incorrect multiple-choice options that are semantically or acoustically very similar to the correct answer, making discrimination difficult