Visual Chain-of-Thought: A reasoning process where intermediate steps involve generating new visual artifacts (edited images) rather than just text.
Structured Images: Images containing organized data representations like tables, bar charts, and scientific plots, distinct from natural scenes.
Selective Attention: The ability to focus processing resources on specific relevant parts of an input while ignoring distractions.
SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to a specific task.
Visual Grounding: Linking textual concepts (e.g., 'the third column') to specific pixel regions in an image.
OpenCV: Open Source Computer Vision Library—a library of programming functions mainly aimed at real-time computer vision, used here for image editing.
Hallucination: When a model generates plausible-sounding but factually incorrect information not present in the source.