← Back to Paper List

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hildegard Kuehne, Rogério Feris, James R. Glass
arXiv.org (2025)
MM RL Speech QA Benchmark

📝 Paper Summary

Audio LLMs Reinforcement Learning for Multi-modal Models Reasoning in Multi-modal LLMs
Omni-R1 demonstrates that fine-tuning Audio LLMs with Reinforcement Learning on text-only data significantly improves audio question-answering performance by enhancing underlying text-based reasoning.
Core Problem
Current Audio LLMs often lack robust reasoning capabilities for complex audio question answering, and standard fine-tuning methods require expensive labeled audio-text pairs.
Why it matters:
  • Improving Audio LLMs is critical for understanding sounds, speech, and music in real-world contexts
  • Reinforcement Learning has improved text LLMs (e.g., DeepSeek-R1), but its application to multi-modal audio models is underexplored
  • Reliance on audio data for fine-tuning limits scalability due to data scarcity and computational costs
Concrete Example: When asked a question requiring external knowledge about a sound, a base Audio LLM might fail not because it mishears the audio, but because its reasoning chain is weak. Omni-R1 shows that fixing the reasoning via text-only training fixes the audio task performance.
Key Novelty
Omni-R1: RL Fine-tuning for Audio LLMs with Text-Only Surprise
  • Applies Group Relative Policy Optimization (GRPO) to Qwen2.5-Omni, a multi-modal LLM, to improve audio QA performance without complex chain-of-thought prompts
  • Generates synthetic training data (AVQA-GPT, VGGS-GPT) by prompting ChatGPT with audio captions to create large-scale Q&A pairs
  • Discovers that fine-tuning on text-only Q&A (without audio input) yields comparable improvements to audio-based training, proving gains stem from better text reasoning
Evaluation Highlights
  • Achieves SOTA on MMAU Test-mini (71.3%) and Test-full (71.2%) benchmarks using synthetic VGGS-GPT data
  • Outperforms base Qwen2.5-Omni by +5.4% absolute on MMAU Test-mini (65.9% -> 71.3%)
  • Text-only fine-tuning on science questions (ARC-Easy) improves audio QA performance from 65.9% to 68.2%, nearly matching audio-based fine-tuning
Breakthrough Assessment
8/10
Achieves SOTA on key audio benchmarks and offers a significant scientific insight: improving text reasoning directly boosts multi-modal performance even without modality-specific training.
×