MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and image data
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-relative rewards to reduce variance
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
VQA: Visual Question Answering—the task of answering natural language questions about an image
Bounding Box: A rectangular box defined by coordinates (usually x_min, y_min, x_max, y_max) that outlines a specific object in an image
REC: Referring Expression Comprehension—the task of localizing a specific image region described by a text query
BLEU: Bilingual Evaluation Understudy—a metric for evaluating text quality by measuring n-gram overlap with a reference text
RL: Reinforcement Learning—training models by rewarding desired behaviors rather than providing explicit correct answers for every step