← Back to Paper List

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Zheng Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, Pheng-Ann Heng
The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory, Tsinghua University
arXiv.org (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Audio-Visual Reasoning Reinforcement Learning (RL) for Reasoning
EchoInk-R1 enhances audio-visual reasoning in multimodal LLMs by applying Group Relative Policy Optimization (GRPO) to fine-tune a model on a curated audio-image question-answering dataset.
Core Problem
Current MLLMs struggle with deep cross-modal reasoning, often relying on shallow correlations instead of coherent, multi-step inference when integrating audio and visual signals.
Why it matters:
  • Existing RL-enhanced models focus mainly on text-only, audio-language, or vision-language tasks, neglecting integrated audio-visual reasoning.
  • MLLMs need to move beyond simple perception to handle complex decision-making scenarios involving multiple synchronized modalities.
  • Without explicit reasoning training, models fail to resolve ambiguities where audio and visual cues conflict or are underspecified.
Concrete Example: In an ambiguous scenario where an image shows a vehicle but the specific event is unclear, a standard model might guess based on visual features alone. EchoInk-R1, however, initially guesses wrong, then explicitly questions its assumption (e.g., 'Wait, the sound indicates...'), and corrects itself by integrating the auditory siren sound to identify the correct emergency context.
Key Novelty
EchoInk-R1: Audio-Visual Reasoning via GRPO
  • Applies Group Relative Policy Optimization (GRPO) specifically to synchronized audio-image question answering, rewarding both accuracy and structured reasoning traces (<think> tags).
  • Introduces a curated dataset (AVQA-R1-6K) designed to force models to integrate both auditory and visual cues rather than relying on a single modality.
Evaluation Highlights
  • +5.24% accuracy improvement on the AVQA-R1-6K validation set compared to the base Qwen2.5-Omni-7B model.
  • Achieves significant reasoning gains with only 562 reinforcement learning steps, demonstrating high sample efficiency.
  • Exhibits emergent 'aha moments' where the model self-corrects initial wrong assumptions within the reasoning trace.
Breakthrough Assessment
7/10
First unified RL framework for open-world audio-visual-text reasoning. Strong empirical gains and qualitative evidence of self-correction ('aha moments'), though dataset scale is relatively small.
×