LMM: Large Multi-Modal Model—an AI model capable of processing and generating content across multiple modalities, typically text and images.
SFT: Supervised Fine-Tuning—the phase where a pre-trained model is trained on labeled instruction-following data to improve its ability to perform specific tasks.
GPT4-Vision: A proprietary multimodal model from OpenAI capable of understanding and describing images with high detail.
Modality Alignment: The process of training a model so that representations from different modalities (e.g., image and text) correspond correctly to each other.
CLIP: Contrastive Language-Image Pre-training—a model trained to predict which caption goes with which image, used here as the vision encoder.
Projector: A neural network component (often an MLP) that maps visual features from the vision encoder into the embedding space of the language model.
Hallucination: When a model generates plausible-sounding but factually incorrect information not present in the source input.
Share-Captioner: The specific captioning model developed in this paper, trained on GPT4-Vision outputs to generate detailed captions at scale.