InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

📝 Paper Summary

Identity-preserved image generation Text-to-Image Generation

InfiniteYou injects identity features into DiT-based models like FLUX via a separate residual branch (InfuseNet) rather than modifying attention layers, enhancing identity preservation without compromising generation quality.

Core Problem

Existing identity-preservation methods for DiTs (like FLUX) rely on modifying attention layers via IP-Adapters, which degrades text alignment, aesthetics, and base model generation capabilities.

Why it matters:

Current methods struggle with 'face copy-paste' artifacts where the identity is preserved but the image looks unnatural or poorly aligned with the text prompt
State-of-the-art DiT models like FLUX offer superior generation quality over U-Nets (SDXL), but effective identity-injection modules for them are scarce
Modifying attention layers directly (standard practice) entangles text and identity control, causing conflict and reducing the model's aesthetic quality

Concrete Example: When asking for 'a woman wearing a VR headset' with a specific identity, standard IPA-based methods might paste the face awkwardly or ignore the headset to preserve the face. InfiniteYou generates the headset correctly while keeping the identity natural.

Key Novelty

InfuseNet: A Parallel Residual Identity Branch

Instead of modifying the base model's attention layers (like IP-Adapter), InfuseNet runs as a parallel branch that injects identity features solely through residual connections
Treats identity injection as a control signal (similar to ControlNet) rather than a texture override, disentangling it from the text prompts processed by the base model
Uses a multi-stage training strategy with synthetic Single-Person-Multiple-Sample (SPMS) data to teach the model robust identity preservation across diverse styles

Architecture

The overall framework of InfiniteYou (InfU) showing the InfuseNet parallel branch interacting with the frozen FLUX base model.

Evaluation Highlights

Achieves higher identity similarity (Identity Score) compared to PuLID-FLUX and InstantX IP-Adapter on benchmark tests
Significant qualitative improvements in text-image alignment and aesthetic quality compared to IP-Adapter methods which often degrade into copy-paste artifacts
Successfully disentangles identity from style, allowing flexible recrafting (e.g., changing age, accessories) where baselines fail

Breakthrough Assessment

8/10

Effective adaptation of ControlNet-like residual injection for identity preservation in DiTs (FLUX), solving the quality degradation issues of attention-based injection methods.

⚙️ Technical Details

Problem Definition

Setting: Tuning-free identity-preserved text-to-image generation

Inputs: Text prompt c_text, Reference identity image c_id, Optional control image (e.g., pose keypoints)

Outputs: Generated image x_0 adhering to text prompt and preserving identity c_id

Pipeline Flow

Face Encoder (extracts identity features)
Projection Network (projects features to latent space)
InfuseNet (Parallel DiT branch processing identity/control)
Base DiT Model (FLUX, frozen, processes text/noise)
Residual Injection (InfuseNet adds features to Base Model blocks)

System Modules

Face Identity Encoder (Input Processing)

Extract identity embeddings from the reference face image

Model or implementation: Frozen face recognition model (likely ArcFace or similar, not explicitly named but standard)

Projection Network (Input Processing)

Project identity embeddings into the dimension required by InfuseNet

Model or implementation: MLP / Projection layers

InfuseNet

Process identity and optional spatial controls to generate residual updates for the base model

Model or implementation: DiT-based structure (smaller copy of base model)

Base DiT Model

Denoise the latent image conditioned on text

Model or implementation: FLUX.1-dev (Frozen)

Novel Architectural Elements

InfuseNet: A generalized ControlNet-like branch specifically for non-spatial identity features
Pure residual injection strategy for identity preservation, completely avoiding attention-layer modification (IP-Adapter)

Modeling

Base Model: FLUX.1-dev

Training Method: Multi-stage training (Pretraining + SFT) with Conditional Flow Matching loss

Objective Functions:

Purpose: Match the vector field of the data distribution.

Formally: L_CFM = E[ || v_theta(x_t, t) - (x_1 - x_0) ||^2 ]

Training Data:

Stage 1 (Pretraining): Real Single-Person-Single-Sample (SPSS) data from human portrait datasets
Stage 2 (SFT): Synthetic Single-Person-Multiple-Sample (SPMS) data generated by the Stage-1 model + off-the-shelf plugins (LoRAs, face swap)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PuLID-FLUX: InfU uses residual injection instead of attention modification, resulting in better text alignment and less 'copy-paste' effect
vs. InstantX IP-Adapter: InfU tailored specifically for faces with InfuseNet, offering higher identity fidelity
vs. Standard ControlNet: InfU generalizes ControlNet to handle non-spatial identity embeddings alongside spatial controls

Limitations

Reliance on synthetic data generation pipeline which is time-consuming and complex to set up
Performance depends heavily on the quality of the base FLUX model
Does not report specific inference latency or memory overhead compared to base FLUX

Reproducibility

Code: https://github.com/bytedance/InfiniteYou

Code and model weights are publicly available at https://github.com/bytedance/InfiniteYou. The paper details the multi-stage data generation process but does not specify the exact size of the datasets or GPU hours used.

📊 Experiments & Results

Evaluation Setup

Identity-preserved generation using specific prompts and identity images

Benchmarks:

Comparison with Baselines (Qualitative and Quantitative Identity Preservation)

Metrics:

Identity Similarity (ID Score)
Text-Image Alignment (CLIP Score)
Aesthetic Quality
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Test Set	Identity Score	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Visual comparison of artifacts. (a) SDXL vs FLUX quality. (b) IP-Adapter vs InfU architecture effects.

Main Takeaways

InfU generates images with higher aesthetic quality and better text alignment than IPA-based methods (PuLID, InstantX)
The residual injection mechanism avoids the 'face copy-paste' look common in baselines, blending the identity more naturally with lighting and style
Multi-stage training with synthetic SPMS data is crucial for improving editability and robustness compared to training only on real SPSS data

📚 Prerequisite Knowledge

Prerequisites

Diffusion Transformers (DiT) architecture
Rectified Flow matching
ControlNet architecture principles
IP-Adapter (Image Prompt Adapter) mechanism

Key Terms

DiT: Diffusion Transformer—a generative model architecture replacing the U-Net backbone with Transformers

FLUX: A state-of-the-art Diffusion Transformer model using rectified flow matching

InfuseNet: The proposed parallel network branch that injects identity features into the base model via residual connections

SPMS: Single-Person-Multiple-Sample—a data format where one real identity is paired with multiple diverse synthetic images of that same identity

IPA: IP-Adapter—a common method for identity injection that modifies the cross-attention layers of the diffusion model

Rectified Flow: A method for training generative models by defining straight paths between noise and data distributions, used by FLUX

SFT: Supervised Fine-Tuning—further training a pre-trained model on high-quality data to improve specific capabilities