MLLM: Multi-Modal Large Language Model—an AI system capable of processing and generating both text and images
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it how to follow instructions
DPO: Direct Preference Optimization—an alignment algorithm that optimizes a model to prefer 'winning' responses over 'losing' ones without needing a separate reward model
IC9600: An Image Complexity assessment model used to score images based on visual clutter and detail
RAM: Recognize Anything Model—a computer vision model used to tag and identify objects within an image
MMMU: A massive multi-discipline multi-modal understanding benchmark requiring expert-level knowledge
WildVision: A benchmark for evaluating MLLMs on diverse, wild (real-world) vision-language tasks
LLaVA-Next: An improved architecture for LLaVA (Large Language and Vision Assistant) enabling better image resolution handling and logic
OCR: Optical Character Recognition—converting text within images into machine-readable text