RLVR: Reinforcement Learning with Verifiable Rewards—a paradigm where models are trained using objective, binary success signals (like math correctness) rather than subjective human preference.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs from the same input within a group, often removing the need for a separate value network.
LVLM: Large Vision-Language Model—a model capable of processing and generating both text and images.
SFT: Supervised Fine-Tuning—training a model to mimic specific ground-truth outputs provided in a dataset.
LLM-as-a-judge: Using a Large Language Model to evaluate the quality of text outputs, often used as a reward signal in RL but prone to biases.
VQA: Visual Question Answering—the task of answering questions about an image.
Prism Framework: An evaluation framework for image captioning that assesses quality based on informativeness and hallucination rates.
KL-divergence penalty: A regularizer used in RL to prevent the trained policy from deviating too far from the reference model's behavior.