Reasoning Segmentation: A proposed task where the model must generate a segmentation mask based on an implicit query requiring complex reasoning or world knowledge, rather than an explicit object description
Referring Segmentation: A standard task where the model segments an object based on an explicit text description (e.g., 'the man in the blue shirt')
<SEG> token: A special token added to the LLM vocabulary; its hidden state embedding is used to condition the mask decoder
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by updating only a small set of low-rank matrices while freezing the main weights
gIoU: Generalized Intersection over Union—a metric for segmentation accuracy that averages IoU per image
cIoU: Cumulative Intersection over Union—a metric calculating intersection over union across the entire dataset cumulatively
SAM: Segment Anything Model—a foundation model for image segmentation used here as the vision backbone
LLaVA: Large Language and Vision Assistant—a multimodal LLM that connects a vision encoder to a language model (Vicuna/Llama)
Embedding-as-mask: The proposed method of using the LLM's hidden embedding of a specific token to generate a segmentation mask