GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, eliminating the need for a separate value function
Audio LLM: A Large Language Model capable of processing and understanding audio inputs in addition to text
MMAU: Multi-Modal Audio Understanding—a benchmark for evaluating audio LLMs on sounds, music, and speech reasoning
MMAR: Multi-Modal Audio Reasoning—a benchmark designed to test deep reasoning capabilities in audio LLMs
AVQA: Audio-Visual Question Answering dataset—used here for audio-based question answering training
KL divergence: A statistical measure used in RL to ensure the fine-tuned model does not deviate too drastically from the reference model
SOTA: State-of-the-Art—the current best performance achievable by any method
SFT: Supervised Fine-Tuning—standard training on labeled data
VGGSound: A large-scale audio-visual dataset used here to generate synthetic training questions