Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

📝 Paper Summary

Text-guided image manipulation Personalization in diffusion models

HiPer enables single-image personalization by optimizing only the tail of the CLIP text embedding to capture subject identity, while using the head for semantic manipulation.

Core Problem

Existing personalization methods (e.g., DreamBooth, Textual Inversion) require multiple reference images, extensive fine-tuning time, or struggle to preserve identity while enabling complex motion/semantic edits.

Why it matters:

Standard diffusion models are stochastic, making it difficult to preserve specific subject identities (e.g., a specific dog's face) during editing
Many real-world scenarios provide only a single image of the subject, rendering multi-shot methods like DreamBooth impractical
Previous methods often trade off editability for identity preservation, failing to handle complex changes like motion or background simultaneously

Concrete Example: Given a single image of a standing dog, turning it into 'a sitting dog' using standard Stable Diffusion loses the specific dog's identity. Imagic often fails to preserve facial details (e.g., hair around head), while Textual Inversion takes ~1 hour to train and struggles with complex motion.

Key Novelty

Highly Personalized (HiPer) Text Embedding Decomposition

Decomposes the text embedding into two parts: a semantic 'head' (from the target prompt) and a personalized 'tail' (optimized to capture the subject)
Optimizes only the tail embedding tokens on a single image to capture identity, leaving the pre-trained model weights frozen
Combines the target text's head embedding with the optimized personalized tail embedding during inference to mix new semantic context with preserved identity

Architecture

Overview of the HiPer method: Training phase optimizing the tail embedding, and Inference phase concatenating target head with optimized tail.

Evaluation Highlights

Achieves personalization training in ~3 minutes using a single image, compared to ~15-40 minutes for baselines
Outperforms Imagic (Stable Diffusion version) and Textual Inversion in user studies for semantic alignment (4.52 vs 3.73/3.07) and identity preservation (4.10 vs 3.25/3.57)
Demonstrates simultaneous manipulation of motion, background, and texture (e.g., making a specific dog jump in a painting style) where baselines often fail

Breakthrough Assessment

7/10

Simple yet highly effective insight about embedding decomposition. Drastically reduces compute/time vs DreamBooth/Textual Inversion while offering better single-image editing flexibility.

⚙️ Technical Details

Problem Definition

Setting: Text-guided image editing with identity preservation using a pre-trained text-to-image diffusion model

Inputs: Single source image x_src, source text prompt y_src, target text prompt y_tgt

Outputs: Edited image x_tgt aligning with y_tgt while preserving subject identity from x_src

Pipeline Flow

Text Embedding Decomposition
Optimization (Training)
Inference (Editing)

System Modules

Text Encoder (CLIP) (Input Processing)

Converts text prompts into token embeddings

Model or implementation: CLIP (frozen)

Embedding Decomposition (Input Processing)

Splits embedding into semantic head (informative) and personalized tail (to be optimized)

Model or implementation: N/A (Slicing operation)

U-Net (Stable Diffusion)

Denoises latent representations conditioned on the composite embedding to optimize the tail

Model or implementation: Stable Diffusion (frozen)

Inference Composer

Combines target text semantic head with optimized personalized tail

Model or implementation: N/A (Concatenation)

Novel Architectural Elements

Decomposition of CLIP embedding space into 'semantic head' (fixed) and 'personalized tail' (optimized)
Composite embedding construction mixing target semantics with source personalization tokens without model fine-tuning

Modeling

Base Model: Stable Diffusion (publicly available pretrained version)

Training Method: Embedding Optimization (Latent Space)

Objective Functions:

Purpose: Minimize reconstruction error of the source image by optimizing only the personalized tail embedding.

Formally: min_{e_h} || epsilon - epsilon_theta(x_t, t, [e_src'; e_h]) ||^2

Adaptation: Text Embedding Optimization (similar to Textual Inversion but partial/decomposed)

Trainable Parameters: Only the last N=5 tokens of the text embedding (e_hper)

Training Data:

Single source image
Source text prompt describing the image (e.g., 'a standing dog')

Key Hyperparameters:

optimization_steps: 1000
learning_rate: 5e-3
personalized_tokens_N: 5
+ 1 more
inference_scale_factor: 0.8

Compute: Training time: ~3 minutes on NVIDIA GeForce GTX 3090

Comparison to Prior Work

vs. Imagic: HiPer requires no model fine-tuning and uses a simpler single-stage optimization of embedding tails. Imagic uses interpolation which hurts identity.
vs. DreamBooth: HiPer works with 1 image vs 3-5, no model weight updates (storage efficient), faster training.
vs. Textual Inversion: HiPer optimizes only the uninformative tail rather than a whole new token, and explicitly mixes target semantics via vector concatenation rather than just prompt composition.

Limitations

Struggles with counting/numeracy changes (e.g., 'two baskets' often fails to change number of objects)
Color attributes can sometimes be inconsistent if not explicitly specified in target prompt
Ineffective for complex artificial products compared to natural subjects (animals, humans)
Depends on quality of base Stable Diffusion model (generates unnatural artifacts outside personalization region)

Reproducibility

Code: http://hiper0.github.io/

Code availability stated as project page http://hiper0.github.io/. Uses standard Stable Diffusion backbone. Hyperparameters (N=5, steps=1000, lr=5e-3) explicitly provided. Datasets used: Ted (from Imagic) and LAION.

📊 Experiments & Results

Evaluation Setup

Qualitative and quantitative comparison on text-guided image manipulation tasks (motion, background, style)

Benchmarks:

Custom dataset (Ted from Imagic + LAION) (Image editing)

Metrics:

CLIP Score (Semantic Alignment)
Identity Preservation (L2 distance/Similarity)
User Study (1-5 scale for Identity and Semantic alignment)
Training Time
Statistical methodology: User study with 20 participants via Google Forms

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Quantitative comparison shows HiPer outperforms Stable Diffusion-based baselines in user preference and achieves competitive CLIP scores with significantly faster training.
User Study	Semantic Alignment (1-5)	3.731	4.520	+0.789
User Study	Identity Preservation (1-5)	3.251	4.099	+0.848
Training Time	Minutes	14.08	3.0	-11.08
CLIP Score	Text Alignment	0.1955	0.2047	+0.0092

Experiment Figures

Cross-attention maps comparing standard embeddings vs HiPer embeddings.

Ablation study on the number of personalized tokens N.

Main Takeaways

HiPer effectively separates semantic content (head) from identity (tail), allowing diverse edits (motion, style) without losing the subject.
Increasing the number of personalized tokens (N) improves identity but reduces editability (overfitting); N=5 is the optimal sweet spot.
Cross-attention analysis confirms the 'tail' tokens in standard CLIP embeddings are uninformative, making them ideal candidates for carrying personalized identity info without interfering with the prompt's semantic structure.
Imagic (when run on Stable Diffusion) suffers from poor identity preservation due to embedding interpolation; HiPer's concatenation strategy preserves structure better.

📚 Prerequisite Knowledge

Prerequisites

Latent Diffusion Models (Stable Diffusion)
CLIP text embeddings and tokenization
Textual Inversion / DreamBooth concepts

Key Terms

CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images and text in a shared embedding space

Stable Diffusion: A latent text-to-image diffusion model that generates images from noise conditioned on text embeddings

Textual Inversion: A method to find a new token embedding that represents a specific concept/object, enabling personalization

DreamBooth: A method that fine-tunes the entire diffusion model weights to bind a unique identifier to a specific subject

DDPM: Denoising Diffusion Probabilistic Models—generative models that learn to reverse a gradual noise-addition process

cross-attention: Mechanism in Transformers where the image generation process attends to the text tokens to guide synthesis