Visual CoT: Visual Chain-of-Thought—using intermediate visual outputs (like bounding boxes) as reasoning steps to guide the final answer
RoI: Region-of-Interest—a specific rectangular area within an image that is relevant to the current task
MoVE: Mixture-of-Vision-Experts—combining multiple vision encoders (e.g., CLIP, ConvNeXt, EVA-02) to capture different types of visual information
Re-sampling: Extracting and reusing existing visual tokens from the feature map corresponding to a bounding box, preserving positional context without re-running the encoder
Re-encoding: Cropping the image based on a bounding box, resizing/padding it, and passing it through the vision encoder again as a new image
Grounding: The task of linking textual concepts (e.g., 'the red ball') to specific spatial regions (bounding boxes) in an image
Stimulus-driven attention: Automatic bottom-up attention driven by salient objects in the image (represented by the initial image tokenization)
Goal-directed attention: Top-down conscious selection of attention driven by user intent (represented by the language-guided RoI selection)