Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

📝 Paper Summary

Personalized Text-to-Image Generation Identity Preservation

Infinite-ID decouples identity and text processing in diffusion models by using separate training paths and a mixed attention mechanism to merge them during inference, improving both identity fidelity and semantic consistency.

Core Problem

Existing methods entangle reference image identity features with text prompt features, forcing a trade-off where improving identity fidelity degrades prompt adherence (semantic consistency) and vice versa.

Why it matters:

Methods like PhotoMaker compress image features into text space, weakening identity details
Methods like IP-Adapter inject strong image features directly into the U-Net, often overpowering text prompts and ignoring semantic instructions
Current tuning-free personalization struggles to generate high-fidelity portraits in complex, novel scenes described by text

Concrete Example: When prompted with 'a man in a superman costume' using an image of Elon Musk, prior methods might produce a generic Superman (losing Elon's face) or a photo of Elon in a suit (ignoring the 'superman costume' text). Infinite-ID aims to produce Elon's exact face in the correct costume.

Key Novelty

ID-Semantics Decoupling Paradigm

Trains the model to recognize identity using only image inputs (ignoring text) via a specialized image cross-attention module, preventing text from interfering with identity learning
Reintroduces text during inference using a separate 'Mixed Attention' mechanism that fuses features from the text-driven self-attention and identity-driven cross-attention layers
Uses an AdaIN-mean operation to normalize feature statistics, allowing precise style control without retraining

Architecture

The training pipeline of Infinite-ID.

Evaluation Highlights

Achieves highest ID fidelity (0.83 DINO score) compared to PhotoMaker (0.76) and IP-Adapter (0.78) on evaluation benchmarks
Maintains strong semantic consistency (0.28 CLIP-T score), outperforming IP-Adapter (0.26) while matching the text-focused PhotoMaker
Demonstrates superior qualitative performance in style transfer tasks compared to StyleAligned and InstantStyle

Breakthrough Assessment

7/10

Strong methodological contribution in decoupling ID/text streams to solve the fidelity-consistency trade-off. Results show clear quantitative improvement over popular baselines like IP-Adapter.

⚙️ Technical Details

Problem Definition

Setting: Tuning-free personalized text-to-image generation given a single reference image

Inputs: Reference identity image x_id and text prompt T

Outputs: Generated image x_gen preserving identity of x_id and semantics of T

Pipeline Flow

Face Embeddings Extractor (processes reference image)
Identity-Enhanced Training (learning ID without text)
Mixed Attention Inference (merging separate ID and Text streams)

System Modules

Face Embeddings Extractor

Extract comprehensive identity features from the reference image

Model or implementation: CLIP Image Encoder + Face Recognition Backbone

Image Cross-Attention (Generation)

Inject identity information into the U-Net

Model or implementation: Trainable Cross-Attention Layers (added to SDXL U-Net)

Mixed Attention Mechanism (Generation)

Fuse text and identity features during inference

Model or implementation: Custom Attention Operation

Novel Architectural Elements

Decoupled training strategy: Deactivating text cross-attention during ID training
Mixed Attention Module: Concatenating text self-attention keys/values with identity cross-attention keys/values
AdaIN-mean operation within attention blocks for style alignment

Modeling

Base Model: Stable Diffusion XL (SDXL)

Training Method: Identity-enhanced training on top of frozen SDXL

Objective Functions:

Purpose: Denoising score matching.

Formally: Standard diffusion loss L_diffusion using identity condition c_id instead of text

Adaptation: Trainable adapters: Face mapper, CLIP mapper, Image Cross-Attention layers

Trainable Parameters: Face/CLIP mappers and Image Cross-Attention modules only (SDXL U-Net frozen)

Key Hyperparameters:

CLIP_embedding_length: 257
Face_embedding_dim: 512
Projected_dim: 1664 (aligned with text features)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PhotoMaker: Decouples ID from text embedding space to prevent ID information loss
vs. IP-Adapter: Trains without text prompts to force stronger ID learning; uses mixed attention for better semantic merging
vs. DreamBooth: Tuning-free (feed-forward) rather than requiring per-subject fine-tuning

Limitations

Reliance on accurate face detection and alignment
Requires pre-trained face recognition backbone
Computational cost of additional attention layers during inference

Reproducibility

Code: https://infinite-id.github.io/

Code not yet released. Project page exists at https://infinite-id.github.io/. Uses pre-trained components (SDXL, CLIP, Face Recognition backbone).

📊 Experiments & Results

Evaluation Setup

Personalized image generation using a test set of identity images

Benchmarks:

Custom Evaluation Set (Identity-preserving generation) [New]

Metrics:

DINO Score (Identity Fidelity)
CLIP-I (Identity Fidelity)
CLIP-T (Text/Semantic Consistency)
Face Similarity (Face Sim)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Quantitative comparison against state-of-the-art tuning-free personalization methods.
Custom Evaluation Set	DINO (Identity Fidelity)	0.783	0.834	+0.051
Custom Evaluation Set	CLIP-T (Semantic Consistency)	0.264	0.282	+0.018
Custom Evaluation Set	Face Sim	0.686	0.795	+0.109

Experiment Figures

The inference-time Mixed Attention Mechanism.

Main Takeaways

Infinite-ID achieves a superior balance between ID fidelity and text consistency compared to PhotoMaker (good text, weak ID) and IP-Adapter (good ID, weak text).
The method generalizes well to style transfer tasks, maintaining structure while applying style via the AdaIN-mean operation.
Qualitative results show better preservation of facial details (e.g., gaze, expression) than competitors.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Stable Diffusion / SDXL)
Cross-Attention vs Self-Attention mechanisms
Contrastive Language-Image Pre-training (CLIP)
AdaIN (Adaptive Instance Normalization)

Key Terms

AdaIN: Adaptive Instance Normalization—a technique to align the mean and variance of content features to match the style statistics of reference features

U-Net: The core neural network architecture in Stable Diffusion that predicts noise to denoise images

Cross-Attention: Attention mechanism where the model attends to external conditioning signals (like text or reference images)

Self-Attention: Attention mechanism where the model attends to its own internal spatial features to maintain structural consistency

CLIP: Contrastive Language-Image Pre-training—a model that embeds images and text into a shared space