LMM: Large Multimodal Model—a model capable of processing and generating both text and images (e.g., LLaVA, GPT-4V)
Instruction Tuning: Training a model on dataset of (instruction, output) pairs to improve its ability to follow user commands
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset
Skill Repository: A collection of specialized pre-trained vision models (tools) that the LMM can call via API
Grounding DINO: An open-set object detection model that finds objects based on text descriptions
SAM: Segment Anything Model—a promptable segmentation system
Elo rating: A comparative ranking system used here to measure relative model performance against human preference
OCR: Optical Character Recognition—converting text in images into machine-encoded text
Hallucination: When a model generates incorrect or nonsensical information not supported by the input