Visual Chain-of-Thought: A reasoning process where intermediate steps involve generating and analyzing visual artifacts (sketches) rather than just text.
Auxiliary lines: Extra lines drawn on a geometry diagram to reveal relationships (e.g., parallel lines, triangles) needed to solve a proof.
SoM: Set-of-Markβa visual prompting technique where objects in an image are overlaid with numbered masks to help LMs reference them.
V*Bench: A benchmark for evaluating MLLMs on detailed visual grounding and reasoning tasks.
BLINK: A benchmark focusing on visual perception tasks that are easy for humans but hard for current MLLMs (e.g., spatial reasoning, depth).
Grounding-DINO: An open-set object detection model that finds objects based on text queries.
Segment Anything (SAM): A model capable of generating segmentation masks for any object in an image.