MIMIC-IT: MultI-Modal In-Context Instruction Tuning—the proposed dataset featuring 2.8M instruction pairs with multi-modal context
Syphus: The automated pipeline proposed in this paper for generating instruction-response pairs using LLMs (ChatGPT/GPT-4) and visual annotations
Otter: The multi-modal model trained on the MIMIC-IT dataset, based on the OpenFlamingo architecture
In-context learning: The ability of a model to learn a task from a few examples provided in the prompt (context) without parameter updates
Egocentric view: First-person perspective (like looking through someone's eyes), crucial for AR/VR applications
Cold-start: A strategy in the Syphus pipeline where initial in-context examples are manually curated or heuristically generated to guide the LLM before large-scale generation
Hallucination: When a model generates plausible but incorrect or factually baseless information
Elo rating: A comparative ranking system used here to evaluate model performance based on pairwise comparisons of responses