Scene Graph (SG): A structured representation of an image where nodes are objects and edges represent relationships (e.g., 'cup on table') or attributes (e.g., 'red cup')
LMM: Large Multimodal Model—an AI model capable of processing and reasoning over both text and images (e.g., GPT-4V, LLaVA)
Compositionality: The ability to understand a complex scene by understanding its parts (objects) and how they combine (relationships/attributes), rather than just listing isolated elements
Chain-of-Thought (CoT): A prompting technique where the model is asked to generate intermediate reasoning steps before the final answer
Zero-shot: Performing a task without seeing any specific training examples for that task beforehand
Catastrophic forgetting: A phenomenon where a model forgets previously learned information upon learning new information (e.g., fine-tuning on scene graphs makes it forget general knowledge)
JSON: JavaScript Object Notation—a structured text format used here to force the model to organize scene graph outputs strictly