MyVLM: Personalizing VLMs for User-Specific Queries

📝 Paper Summary

Personalization of Vision-Language Models Concept Learning

MyVLM enables frozen vision-language models to recognize and contextualize user-specific concepts by using external detection heads to trigger the injection of learned concept embeddings.

Core Problem

Current VLMs possess generic knowledge (recognizing "a dog") but lack understanding of user-specific concepts (recognizing "your dog"), and fine-tuning them is expensive and prone to forgetting.

Why it matters:

Users want meaningful interactions reflecting personal experiences (e.g., asking what 'I' am doing in a photo), not just generic descriptions
Full fine-tuning of large VLMs for every user is computationally prohibitive and degrades general performance (catastrophic forgetting)
Existing model editing techniques focus on factual edits (e.g., changing capitals) rather than visual concept recognition and contextualization

Concrete Example: A standard VLM sees an image of a specific person and outputs 'A man sitting on a bench.' MyVLM recognizes the user and outputs 'S* is sitting on a bench,' enabling questions like 'What is S* wearing?'

Key Novelty

Augmenting Frozen VLMs with Concept Heads and Embeddings

Uses external 'concept heads' (classifiers) as toggles to detect if a specific user-concept is present in the image
If detected, a learned 'concept embedding' is injected into the VLM's intermediate feature space to guide the language generation
Keeps the massive VLM backbone completely frozen, ensuring general capabilities are preserved while adding personalization

Architecture

The MyVLM pipeline integration with BLIP-2/LLaVA. It shows the flow from image input to concept detection and embedding injection.

Breakthrough Assessment

7/10

A clever architectural solution effectively separating recognition (heads) from contextualization (embeddings) without fine-tuning the backbone. Addresses a high-value user application (personalization) efficiently.

⚙️ Technical Details

Problem Definition

Setting: Few-shot personalization of a pretrained VLM given ~3-5 images of a target concept

Inputs: Input image I and a set of concept-specific reference images/captions

Outputs: Personalized text caption or answer answering queries about the specific concept in image I

Pipeline Flow

Input Image -> Vision Encoder (Frozen)
Concept Heads -> Detect Presence of Target Concept
If Detected -> Append Concept Embedding to Visual Features
Visual Features + Concept Embedding -> VLM Bridge (Q-Former/Linear)
LLM -> Generate Personalized Text

System Modules

Vision Encoder

Extract generic visual features from the input image

Model or implementation: ViT-L/14 (from CLIP or EVA-CLIP depending on VLM)

Concept Heads

Identify if the user-specific concept is present in the image to trigger personalization

Model or implementation: Linear classifier (on CLIP embeddings) or Face Recognition Network

Concept Embedding Injection

Insert the learned vector representing the concept into the visual token sequence

Model or implementation: Learnable Vector e*

Language Model

Generate text response contextualizing the concept

Model or implementation: LLM (e.g., Opt, Vicuna)

Novel Architectural Elements

External Concept Heads functioning as 'toggles' to conditionally inject embeddings
Hybrid optimization pipeline where recognition (heads) is decoupled from communication (embeddings)

Modeling

Base Model: BLIP-2 (ViT-L/14 + Q-Former + Opt) or LLaVA (CLIP-ViT-L/14 + Vicuna)

Training Method: Direct optimization of a concept embedding vector (embedding inversion)

Objective Functions:

Purpose: Ensure the model generates the correct personalized caption.

Formally: Cross-entropy loss between generated caption and target caption containing identifier S*.
Purpose: Prevent the concept embedding from dominating attention and ignoring image features.

Formally: L2 regularization on the attention weights assigned to the concept embedding in the Q-Former.

Adaptation: Learning a single embedding vector (and concept heads); VLM weights remain frozen

Training Data:

3-5 images per concept
10 QA pairs for VQA optimization

Key Hyperparameters:

query_tokens: 32 (BLIP-2)
token_dimension: 768
training_images: 3-5

Compute: Not reported in the paper

Comparison to Prior Work

vs. Textual Inversion: MyVLM injects embeddings into the *visual* feature space (or Q-Former input) rather than the text encoder input, allowing better visual alignment
vs. DreamBooth: MyVLM freezes the base model to avoid catastrophic forgetting, whereas DreamBooth fine-tunes weights
vs. Model Editing: MyVLM focuses on visual recognition and contextualization of new concepts rather than correcting factual text queries
+ 1 more
vs. ELITE [not cited in paper]: ELITE uses an encoder to predict embeddings for generation; MyVLM optimizes embeddings directly for captioning/VQA

Limitations

Requires training separate concept heads and embeddings for each new user concept
Reliance on external heads means personalization fails if the head fails to detect the object
The provided text cuts off before reporting quantitative limitations or failure cases

Reproducibility

Code: https://snap-research.github.io/MyVLM/

Project page is available. The paper states the object dataset 'will be publicly available'. Specific training times and GPU resources are not reported in the provided text.

📊 Experiments & Results

Evaluation Setup

Personalized Image Captioning and Visual Question Answering on user-provided concepts

Benchmarks:

New Object/Individual Dataset (Personalized Captioning/VQA) [New]

Metrics:

Metrics not explicitly defined in the provided text (likely text-image similarity or caption quality metrics)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Visualization of attention maps in the LLaVA language model self-attention layers.

Main Takeaways

The method claims to generalize to unseen images of learned concepts while preserving model behavior on unrelated inputs (due to the concept head toggle mechanism).
The authors find that visual features of frozen VLMs are not expressive enough to distinguish specific user concepts alone, justifying the need for external concept heads.
Regularization of attention weights is critical; without it, the learned concept embedding dominates the Q-Former attention, causing the model to ignore the actual image content.
The approach is model-agnostic, demonstrated on both BLIP-2 and LLaVA architectures.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) like BLIP-2 or LLaVA
Familiarity with embedding inversion/optimization (like Textual Inversion/DreamBooth)
Basics of Transformer attention mechanisms (cross-attention)

Key Terms

VLM: Vision-Language Model—an AI that processes both images and text to perform tasks like captioning or VQA

Concept Head: An external classification network (e.g., a linear layer on CLIP) trained to simply detect if the target concept exists in the image

Concept Embedding: A learnable vector representation of the user-specific object injected into the model's feature space to represent the concept

Q-Former: Querying Transformer—a component in BLIP-2 that bridges the gap between the frozen vision encoder and the frozen language model

Catastrophic Forgetting: The tendency of neural networks to lose previously learned knowledge when trained on new data

REC: Referring Expression Comprehension—locating a specific object in an image described by a text query (e.g., bounding box prediction)