RubiCap: Rubric-Guided Reinforcement Learning—the proposed framework using dynamic rubrics for RL rewards
RLVR: Reinforcement Learning with Verifiable Rewards—RL applied to domains where correctness can be objectively checked (e.g., math, code)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input against their group mean
SFT: Supervised Fine-Tuning—training a model to mimic a reference dataset
Dense Captioning: Generating detailed descriptions of images, including objects, attributes, and spatial relationships
Rubric: A sample-specific set of binary criteria (checklist) used to evaluate a generated caption
VLM-as-a-judge: Using a Vision-Language Model to score the quality of other models' outputs
Hallucination: When a model generates content that is not present in the source image
Catastrophic Forgetting: The tendency of a model to lose previously learned knowledge when fine-tuned on new data