JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

📝 Paper Summary

Personalized Text-to-Image Generation Finetuning-free Customization

JeDi enables finetuning-free personalization by training a diffusion model on the joint distribution of multiple images sharing a subject, allowing new images to be generated via inpainting conditioned on reference images.

Core Problem

Existing personalization methods either require resource-intensive finetuning (slow, prone to overfitting) or use encoder-based finetuning-free approaches that suffer from information loss, failing to preserve identity details of uncommon subjects.

Why it matters:

Users want to generate images of their specific possessions in new contexts without waiting for slow training processes
Encoder-based methods often output generic versions of specific objects (e.g., a generic dog instead of a specific pet) due to compression loss
Finetuning-based methods struggle with overfitting when only a few reference images are available

Concrete Example: When given a reference image of a specific, unusual stuffed toy and asked to generate it 'on the beach', encoder-based methods like BLIP-Diffusion might generate a generic teddy bear, losing the unique texture and shape of the original toy.

Key Novelty

Joint-Image Diffusion (JeDi)

Treats personalization as an inpainting task within a joint distribution: the model learns to generate a set of related images simultaneously rather than independent samples
Uses 'Coupled Self-Attention' layers where pixels in one image can attend to pixels in all other images in the batch, establishing correspondence without explicit encoding
Creates a large-scale synthetic dataset (S3) of same-subject image clusters using LLMs and existing diffusion models to train this joint distribution capability

Architecture

Illustration of the Coupled Self-Attention mechanism compared to standard self-attention.

Evaluation Highlights

Outperforms finetuning-free baselines (ELITE, BLIP-Diffusion) and even finetuning-based methods (DreamBooth, CustomDiffusion) in subject fidelity metrics
Achieves higher CLIP-I and DINO scores than DreamBooth on the DreamBooth dataset, indicating better preservation of subject identity
Generates high-fidelity results using as few as one reference image without any optimization at test time

Breakthrough Assessment

8/10

Significant advance in finetuning-free personalization. By abandoning the encoder bottleneck in favor of joint attention, it solves the identity preservation issue that plagued prior instant customization methods.

⚙️ Technical Details

Problem Definition

Setting: Given reference images of a subject and a text prompt, generate a new image of that subject following the prompt without updating model weights.

Inputs: A set of reference images x_ref and a target text prompt y

Outputs: A generated image x_gen adhering to prompt y and depicting the subject in x_ref

Pipeline Flow

Data Generation (S3 Dataset Creation)
Joint-Image Training (Coupled Self-Attention U-Net)
Inference (Personalization as Inpainting)

System Modules

S3 Dataset Generator

Create training data of image clusters sharing subjects

Model or implementation: ChatGPT + SDXL + InstructPix2Pix

Joint-Image U-Net (Inference)

Denoise multiple images simultaneously while sharing information across them

Model or implementation: Modified SDXL U-Net with Coupled Self-Attention

Sampler (Inference)

Iterative denoising with image guidance

Model or implementation: DDIM / Standard Diffusion Sampler

Novel Architectural Elements

Coupled Self-Attention: Reshaping attention inputs to enable cross-image attention within a single batch forward pass, linking features of the generated image directly to reference images.
Input Concatenation Scheme: Modifying U-Net input to accept a concatenated list of noisy images, clean reference images, and masks for the inpainting-style task formulation.

Modeling

Base Model: SDXL (Stable Diffusion XL)

Training Method: Joint-Image Training via epsilon-prediction

Objective Functions:

Purpose: Denoise the joint image set.

Formally: L = E_{x, epsilon, t} [ || epsilon - epsilon_theta(x_t, t, c) ||^2 ] where x is the image set.
Purpose: Train for inpainting capability.

Formally: L_inpainting includes masking reference images with probability 0.5 during training.

Training Data:

Synthetic Same-Subject (S3) dataset (1.6M sets)
WebVid10M video frames
Objaverse rendered multi-view images
LAION aesthetic (single images)

Key Hyperparameters:

image_set_size_training: Randomly 2, 3, or 4
reference_usage_probability: 0.5
CLIP_filtering_threshold: 0.95

Compute: Not reported in the paper

Comparison to Prior Work

vs. DreamBooth: JeDi requires NO test-time optimization (finetuning-free)
vs. BLIP-Diffusion/ELITE: JeDi uses joint attention instead of compressing images into embeddings, preserving more detail
vs. PhotoMaker [not cited in paper]: PhotoMaker creates a stacked ID embedding; JeDi uses coupled attention directly on spatial features.

Limitations

Requires constructing a large synthetic dataset of same-subject images for training.
Inference cost scales with the number of reference images due to joint processing in attention layers.
Heavily relies on the quality of the synthetic data generator (SDXL) and background augmentation.

Reproducibility

Code: https://research.nvidia.com/labs/dir/jedi

Project page provided (https://research.nvidia.com/labs/dir/jedi). The paper details the data synthesis pipeline using public models (ChatGPT, SDXL). Code availability is stated as likely on the project page.

📊 Experiments & Results

Evaluation Setup

Personalized generation on the DreamBooth test set.

Benchmarks:

DreamBooth Test Set (Subject-driven generation)

Metrics:

CLIP-T (Text Alignment)
CLIP-I (Identity Preservation)
DINO (Identity Preservation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
JeDi outperforms both finetuning-based and finetuning-free baselines on identity preservation metrics (DINO, CLIP-I) while maintaining competitive text alignment.
DreamBooth Test Set	DINO	0.668	0.735	+0.067
DreamBooth Test Set	DINO	0.594	0.735	+0.141
DreamBooth Test Set	CLIP-I	0.767	0.835	+0.068
DreamBooth Test Set	CLIP-T	0.306	0.315	+0.009

Experiment Figures

Visualization of attention maps in the coupled self-attention layers.

Main Takeaways

Encoder-free architecture (Joint-Image Diffusion) significantly preserves fine-grained details better than encoder-based methods.
The method scales effectively with the number of reference images; more references provide more views for the attention mechanism to attend to.
Synthetic data generation using LLMs and SDXL is a viable strategy for training personalization models without collecting real-world same-subject datasets.

📚 Prerequisite Knowledge

Prerequisites

Denoising Diffusion Probabilistic Models (DDPM)
U-Net architecture with Self-Attention and Cross-Attention
Classifier-free guidance
Text-to-Image generation basics (latent diffusion)

Key Terms

Joint-Image Diffusion: A framework where the model learns the joint probability distribution of multiple images sharing a common subject, rather than independent marginal distributions.

Coupled Self-Attention: A modification to attention layers where the key/value pairs are concatenated across all images in a batch, allowing each image to attend to features in every other image.

Inpainting: The process of reconstructing missing parts of an image; here, generating a new personalized image is treated as 'inpainting' the missing part of a joint image set given the reference images.

Classifier-free guidance: A sampling technique that improves alignment by extrapolating between conditional and unconditional noise predictions.

S3 Dataset: Synthetic Same-Subject dataset created by the authors, containing clusters of images depicting the same subject in different poses/backgrounds.