LMM: Large Multimodal Model—an extension of Large Language Models that integrates multi-sensory skills like visual understanding
Visual Referring Prompting: A technique where users directly edit input images (e.g., drawing arrows, boxes, or text) to point to specific regions or provide instructions
Interleaved Image-text Inputs: Input sequences containing an arbitrary mix of images and text, allowing for flexible context provision and few-shot examples
In-context Few-shot Learning: Providing the model with example pairs (input-output) within the prompt to guide its performance on a new query without updating model weights
Condition on Good Performance: A prompting strategy that explicitly instructs the model to act as an expert or verify its answer to encourage higher quality outputs
Zero-shot Learning: Asking the model to perform a task without providing any specific examples of that task in the prompt
Dense Captioning: Generating captions for specific regions or objects within an image, rather than just a single global description
OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text
Chain-of-Thought: A prompting technique that encourages the model to generate intermediate reasoning steps before arriving at a final answer