LVLM: Large Vision-Language Model—AI models that can process and reason about both images and text
DPO: Direct Preference Optimization—a method to align models to preferences (like human corrections) without training a separate reward model
Atomic Sentence: A short, independent sentence describing a single specific fact or object in an image, used for precise verification
Hallucination: When a model generates text describing objects or details that are not actually present in the image
OCR: Optical Character Recognition—technology to detect and convert text within images into machine-readable text
SFT: Supervised Fine-Tuning—training a model on labeled examples (image-caption pairs)
Grounding DINO: An open-set object detection model that can find arbitrary objects specified by text prompts
RAM++: Recognize Anything Model—a strong image tagging model used to extract object labels
KL penalty: A regularizer used in RL/DPO to prevent the trained model from deviating too drastically from the reference model