Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

📝 Paper Summary

Audio LLMs Reinforcement Learning for Multi-modal Models Reasoning in Multi-modal LLMs

Omni-R1 demonstrates that fine-tuning Audio LLMs with Reinforcement Learning on text-only data significantly improves audio question-answering performance by enhancing underlying text-based reasoning.

Core Problem

Current Audio LLMs often lack robust reasoning capabilities for complex audio question answering, and standard fine-tuning methods require expensive labeled audio-text pairs.

Why it matters:

Improving Audio LLMs is critical for understanding sounds, speech, and music in real-world contexts
Reinforcement Learning has improved text LLMs (e.g., DeepSeek-R1), but its application to multi-modal audio models is underexplored
Reliance on audio data for fine-tuning limits scalability due to data scarcity and computational costs

Concrete Example: When asked a question requiring external knowledge about a sound, a base Audio LLM might fail not because it mishears the audio, but because its reasoning chain is weak. Omni-R1 shows that fixing the reasoning via text-only training fixes the audio task performance.

Key Novelty

Omni-R1: RL Fine-tuning for Audio LLMs with Text-Only Surprise

Applies Group Relative Policy Optimization (GRPO) to Qwen2.5-Omni, a multi-modal LLM, to improve audio QA performance without complex chain-of-thought prompts
Generates synthetic training data (AVQA-GPT, VGGS-GPT) by prompting ChatGPT with audio captions to create large-scale Q&A pairs
Discovers that fine-tuning on text-only Q&A (without audio input) yields comparable improvements to audio-based training, proving gains stem from better text reasoning

Evaluation Highlights

Achieves SOTA on MMAU Test-mini (71.3%) and Test-full (71.2%) benchmarks using synthetic VGGS-GPT data
Outperforms base Qwen2.5-Omni by +5.4% absolute on MMAU Test-mini (65.9% -> 71.3%)
Text-only fine-tuning on science questions (ARC-Easy) improves audio QA performance from 65.9% to 68.2%, nearly matching audio-based fine-tuning

Breakthrough Assessment

8/10

Achieves SOTA on key audio benchmarks and offers a significant scientific insight: improving text reasoning directly boosts multi-modal performance even without modality-specific training.

⚙️ Technical Details

Problem Definition

Setting: Audio Question Answering where the model selects the correct answer from multiple choices given audio and text input

Inputs: Audio input A (optional during training), Question text Q, Answer choices C

Outputs: Predicted answer choice

Pipeline Flow

Input Processing (Audio + Text)
Qwen2.5-Omni Model (Generation)
GRPO Update (Reinforcement Learning)

System Modules

Qwen2.5-Omni-7B

Multi-modal backbone that processes audio and text to generate answer choices

Model or implementation: Qwen2.5-Omni-7B

Reward Function

Evaluates generated answers for correctness

Model or implementation: Rule-based function

Modeling

Base Model: Qwen2.5-Omni-7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while keeping policy close to reference.

Formally: GRPO objective maximizing advantage weighted by probability ratio, minus KL divergence penalty.

Adaptation: Full fine-tuning (implied by 'full-finetuning on GPUs with only 48GB GPU memory')

Training Data:

AVQA: 40k audio samples with human-annotated questions
AVQA-GPT: 40k samples with synthetic questions generated by ChatGPT from captions
VGGS-GPT: 54k filtered samples (from 182k) with synthetic questions generated by ChatGPT

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 8 (effective, 1 per GPU * 4 GPUs * 2 accumulation steps)
kl_coefficient_beta: 0.04
+ 3 more
temperature: 1.2
group_size: 4 responses per GRPO step
training_steps: 1000 (AVQA/AVQA-GPT), 2000 (VGGS-GPT)

Compute: 1 node with 4 A6000 GPUs (48GB each)

Comparison to Prior Work

vs. R1-AQA: Uses stronger base model (Qwen2.5-Omni vs Qwen2-Audio) and simpler prompt without reasoning
vs. SARI: Uses only RL (GRPO) without SFT schedule, and simpler prompts without explicit reasoning steps
vs. Audio-Flamingo 2: Uses RL fine-tuning rather than just pre-training/SFT
+ 1 more
vs. Turn-based LLM Audio Helpers [not cited in paper]: Omni-R1 integrates audio directly rather than using a separate ASR/captioning tool

Limitations

Synthetic data generation relies on captions which may contain hallucinations (e.g., about music not present)
Performance on mixed audio-music tasks (MMAR Mix1) slightly degraded compared to base model
Text-only fine-tuning improvements are smaller for models that already have strong text reasoning (Qwen2.5-Omni vs Qwen2-Audio)

Reproducibility

Code: https://github.com/roudimit/Omni-R1

Code, models, and datasets planned for release at https://github.com/roudimit/Omni-R1. Method uses open-source Qwen2.5-Omni-7B. Data generation uses closed-source ChatGPT and Qwen-2 Audio captions.

📊 Experiments & Results

Evaluation Setup

Zero-shot audio question answering on standard benchmarks

Benchmarks:

MMAU (Original) (Audio Question Answering (Sounds, Music, Speech))
MMAU (v05.15.25) (Audio Question Answering (Revised))
MMAR (Audio Reasoning (Deep reasoning))

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on MMAU (Original) showing Omni-R1 variants outperforming the base model and prior SOTA.
MMAU (Original) Test-mini	Accuracy	65.9	71.3	+5.4
MMAU (Original) Test-full	Accuracy	68.4	71.2	+2.8
Ablation study on training data source (Human vs Synthetic).
MMAU (Original) Test-mini	Accuracy	68.6	69.9	+1.3
Analysis of Text-Only Fine-Tuning: Validating that removing audio during training still yields gains.
MMAU (Original) Test-mini	Accuracy	65.9	68.2	+2.3
Results on the newer MMAR benchmark.
MMAR	Accuracy	58.0	63.4	+5.4

Main Takeaways

Scaling training data with synthetic questions (VGGS-GPT) consistently outperforms human-annotated data (AVQA), even when the audio source is the same or overlapping
Text-only fine-tuning is surprisingly effective: training on science QA (ARC-Easy) improves audio QA performance almost as much as training on audio QA data
The majority of performance gains from RL fine-tuning in weaker models (Qwen2-Audio) comes from fixing text reasoning, whereas stronger models (Qwen2.5-Omni) see smaller but still significant gains
Audio is not strictly necessary for fine-tuning an audio LLM if the goal is to improve the reasoning component that handles the audio features

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Large Language Models (LLMs) and Multi-modal LLMs
Audio Question Answering benchmarks (MMAU, AVQA)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, eliminating the need for a separate value function

Audio LLM: A Large Language Model capable of processing and understanding audio inputs in addition to text

MMAU: Multi-Modal Audio Understanding—a benchmark for evaluating audio LLMs on sounds, music, and speech reasoning

MMAR: Multi-Modal Audio Reasoning—a benchmark designed to test deep reasoning capabilities in audio LLMs

AVQA: Audio-Visual Question Answering dataset—used here for audio-based question answering training

KL divergence: A statistical measure used in RL to ensure the fine-tuned model does not deviate too drastically from the reference model

SOTA: State-of-the-Art—the current best performance achievable by any method

SFT: Supervised Fine-Tuning—standard training on labeled data

VGGSound: A large-scale audio-visual dataset used here to generate synthetic training questions