VLLM: Vision Large Language Model—AI models that can see images and understand text instructions to generate responses
OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text
Hallucination: A phenomenon where a model generates plausible-sounding but incorrect or factually baseless information
SFT: Supervised Fine-Tuning—training a model on labeled examples to improve its performance on specific tasks
Seed Question: A representative question template designed by experts to guide the automated generation of diverse instructions
MME: A comprehensive evaluation benchmark for Multimodal Large Language Models
LLaVA-Bench: A benchmark assessing VLLM performance on challenging, real-world images (In-the-Wild)
GPT-4V: GPT-4 with Vision—a multimodal model capable of analyzing image inputs