LISA: Reasoning Segmentation via Large Language Model

📝 Paper Summary

Reasoning Segmentation Multimodal Large Language Models (MLLMs) Referring Segmentation

LISA enables multimodal LLMs to perform segmentation tasks requiring complex reasoning by mapping a special `<SEG>` token's embedding directly to a binary mask via an end-to-end trained decoder.

Core Problem

Existing perception systems rely on explicit instructions (e.g., 'segment the dog') and fail to interpret implicit user intentions or complex reasoning (e.g., 'segment the food high in Vitamin C').

Why it matters:

Users in real-world scenarios (like robotics) prefer giving natural, implicit commands rather than step-by-step explicit instructions
Current multimodal LLMs can reason about text but cannot output fine-grained visual masks, limiting their utility in vision-centric tasks
Two-stage approaches that use an LLM to generate text tags for a separate segmentation model often fail due to information loss or disconnection

Concrete Example: If a user asks 'Where can I throw away the rest of the food?' showing a kitchen, standard segmentors fail because they don't know what object matches that intent. LISA reasons that the target is a trash can and segments it directly.

Key Novelty

Embedding-as-Mask Paradigm

Expands the LLM vocabulary with a `<SEG>` token; when the LLM generates this token, its hidden embedding is extracted
This embedding acts as a dynamic instruction for a mask decoding module, bridging the gap between text generation and pixel-level segmentation
Integrates reasoning capabilities of MLLMs with segmentation via end-to-end training, rather than using the LLM just to predict class names

Architecture

The LISA pipeline: Multimodal LLM processes image/text, outputs text and <SEG> token. <SEG> embedding is projected and fed to a Decoder along with visual features to produce a mask.

Evaluation Highlights

LISA-13B (fine-tuned) achieves 63.2 gIoU on the new ReasonSeg benchmark, outperforming the specialized generalist model SEEM (25.6 gIoU) by over 37 points
LISA-7B (fine-tuned) outperforms the two-stage pipeline 'LLaVA1.5 + OVSeg' (49.2 vs 39.7 gIoU) on overall reasoning segmentation, proving the value of end-to-end training
Achieves strong zero-shot performance (36.6 gIoU with LISA-7B) on reasoning tasks despite being trained only on vanilla semantic/referring segmentation data

Breakthrough Assessment

9/10

Establishes a new task (Reasoning Segmentation) and a simple yet highly effective paradigm (embedding-as-mask) that unlocks pixel-level output for LLMs. The performance gap over baselines is massive.

⚙️ Technical Details

Problem Definition

Setting: Generate a binary segmentation mask M given an image x_img and an implicit query text x_txt involving complex reasoning

Inputs: Image x_img and implicit text instruction x_txt (e.g., 'the food that tastes not spicy')

Outputs: Binary segmentation mask M identifying the target object(s)

Pipeline Flow

Visual Encoding: Image -> Vision Backbone -> Visual Features
Multimodal Reasoning: Image + Text -> MLLM -> Text Output + <SEG> Token Embedding
Mask Decoding: <SEG> Embedding + Visual Features -> Decoder -> Binary Mask

System Modules

Vision Backbone (F_enc)

Extract dense visual features from the input image

Model or implementation: SAM ViT-H (frozen)

Multimodal LLM (F)

Understand user instruction and generate text response containing the <SEG> token

Model or implementation: LLaVA-7B or LLaVA-13B (with LoRA)

Projector (gamma) (Mask Decoding)

Project the LLM embedding to the dimension required by the decoder

Model or implementation: MLP [256, 4096, 4096]

Decoder (F_dec) (Mask Decoding)

Generate the final binary segmentation mask

Model or implementation: SAM Decoder architecture

Novel Architectural Elements

Embedding-as-mask: Directly connecting the LLM's specific token embedding to a visual decoder to trigger segmentation

Modeling

Base Model: LLaVA-7B-v1-1 or LLaVA-13B-v1-1 (and v1.5 variants)

Training Method: End-to-end instruction tuning with LoRA on LLM and full tuning on decoder

Objective Functions:

Purpose: Ensure correct text generation.

Formally: Auto-regressive cross-entropy loss L_txt
Purpose: Ensure high-quality mask generation.

Formally: L_mask = lambda_bce * BCE(M_hat, M) + lambda_dice * DICE(M_hat, M)

Adaptation: LoRA (Low-Rank Adaptation) for LLM; Full fine-tuning for Decoder and Projector

Training Data:

Semantic Segmentation (ADE20K, COCO-Stuff, PACO-LVIS, PartImageNet, PASCAL-Part) formatted as QA
Referring Segmentation (refCLEF, refCOCO, refCOCO+, refCOCOg) formatted as QA
Visual Question Answering (LLaVA-Instruct-150k or mix665k)
Reasoning Segmentation (ReasonSeg) - 239 samples for fine-tuning

Key Hyperparameters:

learning_rate: 0.0003
batch_size: 2 (per device, gradient accumulation 10)
optimizer: AdamW
+ 6 more
weight_decay: 0
warmup_iterations: 100
lambda_txt: 1.0
lambda_mask: 1.0
lambda_bce: 2.0
lambda_dice: 0.5

Compute: Less than 3 days on 8 NVIDIA 24G 3090 GPUs (for LISA-7B)

Comparison to Prior Work

vs. OVSeg/GRES: LISA handles implicit reasoning (e.g., 'food with vitamin C') which these explicit-instruction models cannot
vs. VisionLLM: LISA uses embedding-as-mask (dense) rather than polygon sequences (text-based), allowing faster convergence and easier optimization
vs. Two-stage (LLM + OVSeg): LISA is end-to-end, preventing information loss between the reasoning step and the segmentation step

Limitations

Performance on long-query scenarios is strictly better with larger LLMs (13B vs 7B), suggesting the bottleneck is text understanding
Requires fine-tuning on reasoning data (239 samples) to reach peak performance; zero-shot is good but significantly lower
Fine-tuned SAM backbone performs worse than frozen SAM, indicating difficulty in preserving generalization during adaptation

Reproducibility

Code: https://github.com/dvlab-research/LISA

Code, models, and ReasonSeg benchmark data are publicly available at github.com/dvlab-research/LISA. Training relies on public datasets (ADE20K, COCO, etc.) but requires formatting them into QA pairs (templates provided in paper).

📊 Experiments & Results

Evaluation Setup

Evaluate on the proposed ReasonSeg benchmark (reasoning tasks) and standard RefCOCO benchmarks (referring tasks)

Benchmarks:

ReasonSeg (Reasoning Segmentation (implicit query)) [New]
refCOCO / refCOCO+ / refCOCOg (Referring Segmentation (explicit query))

Metrics:

gIoU (Generalized IoU)
cIoU (Cumulative IoU)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reasoning Segmentation results on ReasonSeg benchmark showing LISA's dominance over traditional and generalist baselines.
ReasonSeg	gIoU	25.6	49.4	+23.8
ReasonSeg	gIoU	28.7	49.4	+20.7
ReasonSeg	gIoU	46.1	63.2	+17.1
ReasonSeg	gIoU	25.6	36.6	+11.0
Referring Segmentation results showing LISA is competitive on standard explicit tasks.
refCOCOg val(U)	cIoU	65.7	67.4	+1.7

Experiment Figures

Qualitative comparison between LISA and baselines (OVSeg, GRES, X-Decoder, SEEM) on complex queries.

Main Takeaways

LISA effectively handles complex reasoning queries where traditional referring segmentation models fail completely.
End-to-end training (Embedding-as-Mask) significantly outperforms two-stage approaches (LLM -> Text -> Segmentor) by preserving dense information.
Fine-tuning on a very small set of reasoning data (239 samples) yields massive improvements (>10% gIoU), suggesting the capability is easily unlocked.
Larger language models (13B vs 7B) provide substantial gains in long-query reasoning scenarios.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (architecture and instruction tuning)
Semantic and Referring Segmentation
Parameter-efficient fine-tuning (LoRA)

Key Terms

Reasoning Segmentation: A proposed task where the model must generate a segmentation mask based on an implicit query requiring complex reasoning or world knowledge, rather than an explicit object description

Referring Segmentation: A standard task where the model segments an object based on an explicit text description (e.g., 'the man in the blue shirt')

<SEG> token: A special token added to the LLM vocabulary; its hidden state embedding is used to condition the mask decoder

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by updating only a small set of low-rank matrices while freezing the main weights

gIoU: Generalized Intersection over Union—a metric for segmentation accuracy that averages IoU per image

cIoU: Cumulative Intersection over Union—a metric calculating intersection over union across the entire dataset cumulatively

SAM: Segment Anything Model—a foundation model for image segmentation used here as the vision backbone

LLaVA: Large Language and Vision Assistant—a multimodal LLM that connects a vision encoder to a language model (Vicuna/Llama)

Embedding-as-mask: The proposed method of using the LLM's hidden embedding of a specific token to generate a segmentation mask