Personalize Segment Anything Model with One Shot

📝 Paper Summary

Personalized Segmentation One-shot Learning Foundation Model Adaptation

PerSAM customizes the Segment Anything Model (SAM) for specific visual concepts using a single reference image-mask pair via training-free attention guidance and efficient 2-parameter fine-tuning.

Core Problem

The vanilla Segment Anything Model (SAM) is a generalist that requires manual prompting for every image and lacks the ability to automatically segment a specific personal concept (e.g., 'my pet dog') across different contexts.

Why it matters:

Manually prompting SAM for every image in a large collection is labor-intensive and time-consuming
Generalist models often fail to distinguish specific instances from visually similar objects or backgrounds without instance-specific semantic cues
Subject-driven generation (like DreamBooth) suffers when background visual information leaks into the learned subject representation

Concrete Example: When trying to segment a specific 'teapot' that has a lid and body, standard SAM might ambiguously segment just the lid or the whole pot. PerSAM-F learns to weight these scales correctly from a single reference, whereas vanilla SAM requires manual selection for each new image.

Key Novelty

Training-free Attention Guidance & Scale-Aware Fine-Tuning

Training-free PerSAM: Uses a location confidence map derived from the reference image to guide SAM's internal cross-attention, focusing the model on the target object's features
PerSAM-F (Fine-tuning): Freezes the entire SAM and tunes only 2 parameters (weights for multi-scale masks) to resolve ambiguity between object parts and wholes, taking just 10 seconds
Target-Semantic Prompting: Injects high-level visual embeddings of the target object directly into prompt tokens to supplement SAM's low-level positional cues

Architecture

The complete pipeline: (1) extracting location priors (confidence map) from the reference, (2) injecting target semantics via attention guidance and prompting into SAM's decoder.

Evaluation Highlights

PerSAM-F achieves 95.3% mIoU on the new PerSeg dataset, outperforming the large-scale specialist SegGPT (94.3%) while using only 2 learnable parameters
Training-free PerSAM improves over in-context learner Painter by +32.3% J&F score on video object segmentation (DAVIS 2017)
PerSAM-F fine-tuning takes only 10 seconds on a single A100 GPU, compared to hours for full model tuning

Breakthrough Assessment

8/10

Highly efficient adaptation of a foundation model. The 2-parameter fine-tuning is extremely lightweight yet effective, and the application to improving DreamBooth adds practical value.

⚙️ Technical Details

Problem Definition

Setting: One-shot personalized object segmentation

Inputs: A single reference image I_R with a reference mask M_R, and a target test image I

Outputs: A binary segmentation mask M for the target object in the test image I

Pipeline Flow

Feature Extraction (Image Encoder)
Location Prior Calculation (Confidence Map)
Prompt Generation (Positive/Negative Points + Semantic Embedding)
Decoder Inference (Target-guided Attention)
Post-Refinement (Cascaded Decoding)
Scale-Aware Weighting (PerSAM-F only)

System Modules

Image Encoder

Extract visual features from both reference and test images

Model or implementation: ViT-H (SAM's image encoder, frozen)

Prompt Encoder

Encode positive/negative point priors derived from the confidence map

Model or implementation: SAM's prompt encoder

Mask Decoder

Generate segmentation masks using target-guided attention and semantic prompting

Model or implementation: SAM's mask decoder (modified attention)

Scale-Aware Aggregator

Combine multi-scale masks using learned weights

Model or implementation: Linear weighting layer (2 parameters)

Novel Architectural Elements

Injection of target visual embedding directly into decoder input tokens (Target-semantic Prompting)
Modulation of cross-attention matrices in the decoder using an externally computed location confidence map
Scale-aware output head that aggregates SAM's 3-scale outputs via learnable weights

Modeling

Base Model: Segment Anything Model (SAM) with ViT-H backbone

Training Method: PerSAM-F: One-shot fine-tuning on the reference image-mask pair

Objective Functions:

Purpose: Minimize difference between predicted mask and reference mask.

Formally: Standard Dice Loss and Focal Loss (implied from SAM training, effectively fine-tuning on the one-shot example).

Adaptation: Scale-aware fine-tuning (tuning only 2 scalar weights w1, w2)

Trainable Parameters: 2 parameters

Training Data:

One-shot reference image and mask provided by user

Key Hyperparameters:

training_time: 10 seconds
gpu_type: NVIDIA A100
initial_weights: w1=w2=1/3

Compute: 10 seconds on a single A100 GPU for PerSAM-F fine-tuning

Comparison to Prior Work

vs. SegGPT: PerSAM-F is a specialized lightweight adaptation (2 params) vs. SegGPT's generalist in-context approach; PerSAM-F achieves higher mIoU on PerSeg
vs. Painter: PerSAM uses explicit attention guidance and semantic priors from SAM, whereas Painter relies on image-to-image translation
vs. Matcher [cited in tables]: PerSAM-F outperforms Matcher on LVIS-92i and PASCAL-Part benchmarks

Limitations

PerSAM relies on SAM's pre-trained features; if SAM fails to capture the object features, PerSAM will likely fail
Requires at least one valid reference mask or box; cannot work fully unsupervised
Scale ambiguity can still persist if the target object has very complex hierarchical structures not resolvable by simple linear weighting

Reproducibility

Code: https://github.com/ZrrSkywalker/Personalize-SAM

Code and PerSeg dataset are publicly available. The method relies on the pre-trained SAM checkpoint (ViT-H). Fine-tuning is extremely fast (10s), aiding reproducibility.

📊 Experiments & Results

Evaluation Setup

One-shot personalized segmentation on images and videos

Benchmarks:

PerSeg (Personalized Object Segmentation) [New]
DAVIS 2017 val (Video Object Segmentation)
FSS-1000 (One-shot Semantic Segmentation)
LVIS-92i (One-shot Semantic Segmentation)
PASCAL-Part (One-shot Part Segmentation)

Metrics:

mIoU (mean Intersection over Union)
bIoU (boundary IoU)
J&F score (Jaccard and F-measure for video)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on the newly collected PerSeg dataset show PerSAM-F outperforming generalist models.
PerSeg	mIoU	94.3	95.3	+1.0
PerSeg	mIoU	56.4	89.3	+32.9
PerSeg	bIoU	76.5	77.9	+1.4
Video Object Segmentation results on DAVIS 2017 validation set.
DAVIS 2017 val	J&F	75.6	76.1	+0.5
Ablation study on PerSeg demonstrating the contribution of each component.
PerSeg	mIoU	69.1	95.3	+26.2
PerSeg	mIoU	89.3	95.3	+6.0

Experiment Figures

Qualitative comparison of PerSAM vs. PerSAM-F on ambiguous objects.

PerSAM-assisted DreamBooth results.

Main Takeaways

PerSAM-F effectively resolves segmentation scale ambiguity (e.g., segmenting a whole object vs. a part) by learning just 2 parameters, significantly boosting performance.
Training-free PerSAM already outperforms existing in-context vision models (Painter, SEEM) on personalized tasks without any gradient updates.
The approach generalizes well to video object segmentation (DAVIS 2017) and part segmentation (PASCAL-Part), showing versatility beyond simple image objects.
PerSAM can be used to improve DreamBooth by masking out background noise in training images, leading to better subject fidelity and text-prompt compliance.

📚 Prerequisite Knowledge

Prerequisites

Segment Anything Model (SAM) architecture
Cross-attention mechanisms
One-shot learning
Parameter-efficient fine-tuning (PEFT)

Key Terms

SAM: Segment Anything Model—a foundation model for image segmentation capable of zero-shot transfer via prompting

mIoU: Mean Intersection over Union—a standard metric for evaluating segmentation accuracy

DreamBooth: A method for personalizing text-to-image diffusion models to generate specific subjects from a few images

Location Confidence Map: A heatmap generated by calculating cosine similarity between the reference object's features and the test image features

Target-guided Attention: A mechanism that biases the attention scores in SAM's decoder using the location confidence map to focus on the target object

Scale-aware Fine-tuning: A method in PerSAM-F that learns weights to linearly combine SAM's multi-scale mask outputs, resolving ambiguity between parts and wholes