GRPO: Group-Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs for the same input, removing the need for a separate value network
LALM: Large Audio-Language Model—a multimodal model capable of processing both audio and text inputs
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer
Curriculum Learning: A training strategy where the model is exposed to easier examples first, gradually increasing difficulty to stabilize learning
Structured Reasoning: A specific CoT format enforced in this paper consisting of four sections: Planning, Caption, Reasoning, and Summary
SFT: Supervised Fine-Tuning—training the model on labeled data (input-output pairs) before reinforcement learning
Cold Start: The initial phase of training where the model must learn basic formatting and reasoning patterns via SFT before RL can be effective