LLaVA: Large Language and Vision Assistant—the end-to-end trained large multimodal model introduced in this paper
Instruction Tuning: Fine-tuning language models on datasets consisting of (instruction, output) pairs to improve their ability to follow user commands
CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images and text in a shared embedding space, used here as the visual encoder
Vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations from ShareGPT, used here as the language decoder
ScienceQA: A large-scale multimodal science question dataset annotated with lectures and explanations, used for evaluation
CC3M: Conceptual Captions 3M—a dataset of image-text pairs used for pre-training feature alignment
GPT-4: A large multimodal model from OpenAI; here, the text-only version is used to generate training data, and the multimodal version is a reference baseline
SoTA: State-of-the-Art—the current best performance achievable for a specific task