Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, D. Cohen-Or
Snap Inc.,
Tel Aviv University
European Conference on Computer Vision
(2024)
MMP13NQABenchmark
📝 Paper Summary
Personalization of Vision-Language ModelsConcept Learning
MyVLM enables frozen vision-language models to recognize and contextualize user-specific concepts by using external detection heads to trigger the injection of learned concept embeddings.
Core Problem
Current VLMs possess generic knowledge (recognizing "a dog") but lack understanding of user-specific concepts (recognizing "your dog"), and fine-tuning them is expensive and prone to forgetting.
Why it matters:
Users want meaningful interactions reflecting personal experiences (e.g., asking what 'I' am doing in a photo), not just generic descriptions
Full fine-tuning of large VLMs for every user is computationally prohibitive and degrades general performance (catastrophic forgetting)
Existing model editing techniques focus on factual edits (e.g., changing capitals) rather than visual concept recognition and contextualization
Concrete Example:A standard VLM sees an image of a specific person and outputs 'A man sitting on a bench.' MyVLM recognizes the user and outputs 'S* is sitting on a bench,' enabling questions like 'What is S* wearing?'
Key Novelty
Augmenting Frozen VLMs with Concept Heads and Embeddings
Uses external 'concept heads' (classifiers) as toggles to detect if a specific user-concept is present in the image
If detected, a learned 'concept embedding' is injected into the VLM's intermediate feature space to guide the language generation
Keeps the massive VLM backbone completely frozen, ensuring general capabilities are preserved while adding personalization
Architecture
The MyVLM pipeline integration with BLIP-2/LLaVA. It shows the flow from image input to concept detection and embedding injection.
Breakthrough Assessment
7/10
A clever architectural solution effectively separating recognition (heads) from contextualization (embeddings) without fine-tuning the backbone. Addresses a high-value user application (personalization) efficiently.
⚙️ Technical Details
Problem Definition
Setting: Few-shot personalization of a pretrained VLM given ~3-5 images of a target concept
Inputs: Input image I and a set of concept-specific reference images/captions
Outputs: Personalized text caption or answer answering queries about the specific concept in image I
Pipeline Flow
Input Image -> Vision Encoder (Frozen)
Concept Heads -> Detect Presence of Target Concept
If Detected -> Append Concept Embedding to Visual Features
Visual Features + Concept Embedding -> VLM Bridge (Q-Former/Linear)
LLM -> Generate Personalized Text
System Modules
Vision Encoder
Extract generic visual features from the input image
Model or implementation: ViT-L/14 (from CLIP or EVA-CLIP depending on VLM)
Concept Heads
Identify if the user-specific concept is present in the image to trigger personalization
Model or implementation: Linear classifier (on CLIP embeddings) or Face Recognition Network
Concept Embedding Injection
Insert the learned vector representing the concept into the visual token sequence
Model or implementation: Learnable Vector e*
Language Model
Generate text response contextualizing the concept
Model or implementation: LLM (e.g., Opt, Vicuna)
Novel Architectural Elements
External Concept Heads functioning as 'toggles' to conditionally inject embeddings
Hybrid optimization pipeline where recognition (heads) is decoupled from communication (embeddings)
Modeling
Base Model: BLIP-2 (ViT-L/14 + Q-Former + Opt) or LLaVA (CLIP-ViT-L/14 + Vicuna)
Training Method: Direct optimization of a concept embedding vector (embedding inversion)
Objective Functions:
Purpose: Ensure the model generates the correct personalized caption.
Formally: Cross-entropy loss between generated caption and target caption containing identifier S*.
Purpose: Prevent the concept embedding from dominating attention and ignoring image features.
Formally: L2 regularization on the attention weights assigned to the concept embedding in the Q-Former.
Adaptation: Learning a single embedding vector (and concept heads); VLM weights remain frozen
Training Data:
3-5 images per concept
10 QA pairs for VQA optimization
Key Hyperparameters:
query_tokens: 32 (BLIP-2)
token_dimension: 768
training_images: 3-5
Compute: Not reported in the paper
Comparison to Prior Work
vs. Textual Inversion: MyVLM injects embeddings into the *visual* feature space (or Q-Former input) rather than the text encoder input, allowing better visual alignment
vs. DreamBooth: MyVLM freezes the base model to avoid catastrophic forgetting, whereas DreamBooth fine-tunes weights
vs. Model Editing: MyVLM focuses on visual recognition and contextualization of new concepts rather than correcting factual text queries
Project page is available. The paper states the object dataset 'will be publicly available'. Specific training times and GPU resources are not reported in the provided text.
📊 Experiments & Results
Evaluation Setup
Personalized Image Captioning and Visual Question Answering on user-provided concepts
Benchmarks:
New Object/Individual Dataset (Personalized Captioning/VQA) [New]
Metrics:
Metrics not explicitly defined in the provided text (likely text-image similarity or caption quality metrics)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Visualization of attention maps in the LLaVA language model self-attention layers.
Main Takeaways
The method claims to generalize to unseen images of learned concepts while preserving model behavior on unrelated inputs (due to the concept head toggle mechanism).
The authors find that visual features of frozen VLMs are not expressive enough to distinguish specific user concepts alone, justifying the need for external concept heads.
Regularization of attention weights is critical; without it, the learned concept embedding dominates the Q-Former attention, causing the model to ignore the actual image content.
The approach is model-agnostic, demonstrated on both BLIP-2 and LLaVA architectures.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Vision-Language Models (VLMs) like BLIP-2 or LLaVA
Familiarity with embedding inversion/optimization (like Textual Inversion/DreamBooth)
Basics of Transformer attention mechanisms (cross-attention)
Key Terms
VLM: Vision-Language Model—an AI that processes both images and text to perform tasks like captioning or VQA
Concept Head: An external classification network (e.g., a linear layer on CLIP) trained to simply detect if the target concept exists in the image
Concept Embedding: A learnable vector representation of the user-specific object injected into the model's feature space to represent the concept
Q-Former: Querying Transformer—a component in BLIP-2 that bridges the gap between the frozen vision encoder and the frozen language model
Catastrophic Forgetting: The tendency of neural networks to lose previously learned knowledge when trained on new data
REC: Referring Expression Comprehension—locating a specific object in an image described by a text query (e.g., bounding box prediction)