SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

📝 Paper Summary

Large Audio-Language Models (LALMs) Reinforcement Learning for Reasoning

SARI enhances audio-language models by combining structured chain-of-thought supervised fine-tuning with curriculum-guided reinforcement learning to improve multi-step reasoning.

Core Problem

Current Large Audio-Language Models (LALMs) excel at perception but lack explicit multi-step reasoning capabilities, often failing at complex audio analysis tasks.

Why it matters:

Existing models primarily handle straightforward QA but struggle with complex reasoning required for real-world audio understanding
Applying reinforcement learning to audio reasoning is underexplored compared to text-only domains
Simple RL application without structured guidance often fails to induce correct reasoning or improve performance on difficult audio tasks

Concrete Example: When asked to identify a 'whoop' sound source, a standard model might guess 'Human' without explanation. SARI explicitly plans, captions ('loud, sharp vocalization'), reasons ('compare with bird/machine sounds'), and summarizes to confirm 'Human', avoiding errors through self-correction.

Key Novelty

Structured Audio Reasoning via Curriculum-Guided RL (SARI)

Extends DeepSeek-R1's Group-Relative Policy Optimization (GRPO) to the audio modality, rewarding models for correct reasoning paths
Implements a 'structured' Chain-of-Thought (CoT) format requiring explicit Planning, Captioning, Reasoning, and Summarizing steps
Uses a curriculum learning schedule during RL that orders training samples from easy to hard based on baseline pass rates, preventing the policy from collapsing on difficult examples

Architecture

The Data Construction and Training pipeline. It illustrates how audio data is processed via LLMs to create structured and unstructured thought data, followed by filtering for RL, and then split into SFT and RL phases.

Evaluation Highlights

Achieves state-of-the-art 67.08% accuracy on the MMAU test-mini benchmark using Qwen2.5-Omni as the base, surpassing standard supervised fine-tuning
+16.35% improvement in average accuracy over the Qwen2-Audio-7B-Instruct base model on MMAU test-mini
Curriculum learning adds +1.97% accuracy over standard randomized RL training on the MMAU benchmark

Breakthrough Assessment

8/10

Significantly advances audio reasoning by successfully transferring text-based RL reasoning techniques (GRPO) to audio, demonstrating that structured thought and curriculum are essential for this modality.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice audio question answering requiring multi-step reasoning

Inputs: Audio clip A and a natural language question Q

Outputs: A reasoning chain (structured or unstructured) followed by the final answer option

Pipeline Flow

Audio Input Processing (Qwen2-Audio Encoder)
Reasoning Generation (Structured CoT: Planning → Caption → Reasoning → Summary)
Answer Generation (Final Option Selection)

System Modules

Audio Encoder

Process raw audio into embeddings

Model or implementation: Qwen2-Audio / Qwen2.5-Omni encoder

Reasoning Generator (Generation)

Generate explicit reasoning steps

Model or implementation: LLM Decoder (Qwen2-Audio/Omni)

Answer Head (Generation)

Output final classification

Model or implementation: LLM Decoder

Novel Architectural Elements

Integration of a 4-stage structured reasoning block (Planning, Caption, Reasoning, Summary) within the LALM inference pipeline
Curriculum-guided data scheduler that dynamically feeds inputs based on difficulty scores derived from base model pass rates

Modeling

Base Model: Qwen2-Audio-7B-Instruct and Qwen2.5-Omni

Training Method: Two-stage: Supervised Fine-Tuning (SFT) followed by Group-Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy without a critic network by normalizing rewards within a group.

Formally: GRPO objective (using group average as baseline)
Purpose: Enforce correct answer and formatting.

Formally: Reward = 1 if answer and format are correct, else 0 (implicit description)

Trainable Parameters: Full model fine-tuning (implied by context of SFT and RL)

Training Data:

Data SFT: 2,000 samples with reasoning paths generated by Qwen2.5-72B
Data RL: ~30,000 samples filtered for difficulty (removing 0% pass rate items) from AudioSet, MusicBench, Meld, AVQA

Key Hyperparameters:

sft_epochs: 3
sft_batch_size: 64
sft_learning_rate: 2e-5
+ 6 more
rl_epochs: 1
rl_batch_size: 32
rl_learning_rate: 1e-6
kl_coefficient: 0
temperature: 1.0
num_generations_per_step: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. Audio-Reasoner: SARI adds a curriculum-guided RL stage after SFT, achieving higher performance than SFT alone
vs. R1-AQA: SARI demonstrates significant gains from explicit structured reasoning, contrasting R1-AQA's findings, by using a specific 4-step structure and curriculum
vs. Qwen2-Audio-7B-Instruct (Base): SARI forces explicit 'thinking' before answering, whereas the base model answers directly or with implicit reasoning
+ 1 more
vs. Vision-R1 [not cited in paper]: SARI adapts the reasoning-via-RL paradigm specifically to audio by incorporating audio-specific captioning steps in the structured thought process

Limitations

Dataset size is relatively small (32k samples) with a limited proportion of speech data
Data construction relies on open-source models (Qwen2.5-72B), limiting quality to that of the teacher model
Experiments limited to Qwen2 family models; generalizability to other architectures is untested

Reproducibility

No replication artifacts mentioned in the paper. Code, weights, and the constructed dataset (32k samples) are not explicitly linked or released in the provided text. Uses open-source base models (Qwen2-Audio).

📊 Experiments & Results

Evaluation Setup

Multiple-choice Question Answering on audio tasks

Benchmarks:

MMAU Test-mini (Complex Audio Reasoning (Sound, Music, Speech))
MMSU (Multi-domain knowledge reasoning (Generalization check))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation on Qwen2-Audio-7B-Instruct showing the impact of SFT warm-up, RL, and Curriculum Learning.
MMAU Test-mini	Accuracy	49.20	65.55	+16.35
MMAU Test-mini	Accuracy	63.58	65.55	+1.97
MMAU Test-mini	Accuracy	59.53	63.58	+4.05
MMAU Test-mini	Accuracy	63.68	65.55	+1.87
State-of-the-art comparison using stronger Qwen2.5-Omni base model.
MMAU Test-mini	Accuracy	65.60	67.08	+1.48

Experiment Figures

Completion length of different models during convergence. Comparing RL-only vs SFT+RL, and Structured vs Unstructured.

Main Takeaways

Structured reasoning (Planning, Caption, Reasoning, Summary) yields more robust generalization than free-form unstructured reasoning
Supervised Fine-Tuning (SFT) warm-up is critical; RL alone (Cold Start) fails to produce meaningful reasoning chains
Curriculum learning (Easy-to-Hard) stabilizes RL training and improves convergence compared to random shuffling
Explicit reasoning significantly boosts performance on out-of-domain generalization tasks (MMSU)

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy, reward)
Large Language Models (LLMs) and Chain-of-Thought (CoT)
Audio signal processing basics

Key Terms

GRPO: Group-Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs for the same input, removing the need for a separate value network

LALM: Large Audio-Language Model—a multimodal model capable of processing both audio and text inputs

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

Curriculum Learning: A training strategy where the model is exposed to easier examples first, gradually increasing difficulty to stabilize learning

Structured Reasoning: A specific CoT format enforced in this paper consisting of four sections: Planning, Caption, Reasoning, and Summary

SFT: Supervised Fine-Tuning—training the model on labeled data (input-output pairs) before reinforcement learning

Cold Start: The initial phase of training where the model must learn basic formatting and reasoning patterns via SFT before RL can be effective