CVR: Compositional Visual Reasoning—a paradigm that decomposes visual tasks into structured steps (objects, attributes, relations) rather than mapping inputs directly to answers.
Monolithic Visual Reasoning: End-to-end architectures (like CLIP or standard LLaVA) that encode vision and language jointly to predict answers without explicit intermediate reasoning steps.
Grounding: The process of linking abstract concepts (e.g., 'the red ball') to specific regions or pixels in the visual input.
Chain-of-Thought: A reasoning technique where the model generates a sequence of intermediate logical steps before producing the final answer.
LLM: Large Language Model—AI models trained on vast text data to understand and generate human language.
VLM: Vision-Language Model—AI models that process and relate both image and text inputs.
Systematic Generalization: The ability to understand and reason about novel combinations of known concepts (e.g., recognizing a 'purple giraffe' after seeing 'purple' and 'giraffe' separately).
Hallucination: When a model generates plausible but factually incorrect information, often driven by training data biases rather than the actual input.