InfiniteYou injects identity features into DiT-based models like FLUX via a separate residual branch (InfuseNet) rather than modifying attention layers, enhancing identity preservation without compromising generation quality.
Core Problem
Existing identity-preservation methods for DiTs (like FLUX) rely on modifying attention layers via IP-Adapters, which degrades text alignment, aesthetics, and base model generation capabilities.
Why it matters:
Current methods struggle with 'face copy-paste' artifacts where the identity is preserved but the image looks unnatural or poorly aligned with the text prompt
State-of-the-art DiT models like FLUX offer superior generation quality over U-Nets (SDXL), but effective identity-injection modules for them are scarce
Modifying attention layers directly (standard practice) entangles text and identity control, causing conflict and reducing the model's aesthetic quality
Concrete Example:When asking for 'a woman wearing a VR headset' with a specific identity, standard IPA-based methods might paste the face awkwardly or ignore the headset to preserve the face. InfiniteYou generates the headset correctly while keeping the identity natural.
Key Novelty
InfuseNet: A Parallel Residual Identity Branch
Instead of modifying the base model's attention layers (like IP-Adapter), InfuseNet runs as a parallel branch that injects identity features solely through residual connections
Treats identity injection as a control signal (similar to ControlNet) rather than a texture override, disentangling it from the text prompts processed by the base model
Uses a multi-stage training strategy with synthetic Single-Person-Multiple-Sample (SPMS) data to teach the model robust identity preservation across diverse styles
Architecture
The overall framework of InfiniteYou (InfU) showing the InfuseNet parallel branch interacting with the frozen FLUX base model.
Evaluation Highlights
Achieves higher identity similarity (Identity Score) compared to PuLID-FLUX and InstantX IP-Adapter on benchmark tests
Significant qualitative improvements in text-image alignment and aesthetic quality compared to IP-Adapter methods which often degrade into copy-paste artifacts
Successfully disentangles identity from style, allowing flexible recrafting (e.g., changing age, accessories) where baselines fail
Breakthrough Assessment
8/10
Effective adaptation of ControlNet-like residual injection for identity preservation in DiTs (FLUX), solving the quality degradation issues of attention-based injection methods.
Code and model weights are publicly available at https://github.com/bytedance/InfiniteYou. The paper details the multi-stage data generation process but does not specify the exact size of the datasets or GPU hours used.
📊 Experiments & Results
Evaluation Setup
Identity-preserved generation using specific prompts and identity images
Benchmarks:
Comparison with Baselines (Qualitative and Quantitative Identity Preservation)
Metrics:
Identity Similarity (ID Score)
Text-Image Alignment (CLIP Score)
Aesthetic Quality
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Internal Test Set
Identity Score
Not reported in the paper
Not reported in the paper
Not reported in the paper
Experiment Figures
Visual comparison of artifacts. (a) SDXL vs FLUX quality. (b) IP-Adapter vs InfU architecture effects.
Main Takeaways
InfU generates images with higher aesthetic quality and better text alignment than IPA-based methods (PuLID, InstantX)
The residual injection mechanism avoids the 'face copy-paste' look common in baselines, blending the identity more naturally with lighting and style
Multi-stage training with synthetic SPMS data is crucial for improving editability and robustness compared to training only on real SPSS data
📚 Prerequisite Knowledge
Prerequisites
Diffusion Transformers (DiT) architecture
Rectified Flow matching
ControlNet architecture principles
IP-Adapter (Image Prompt Adapter) mechanism
Key Terms
DiT: Diffusion Transformer—a generative model architecture replacing the U-Net backbone with Transformers
FLUX: A state-of-the-art Diffusion Transformer model using rectified flow matching
InfuseNet: The proposed parallel network branch that injects identity features into the base model via residual connections
SPMS: Single-Person-Multiple-Sample—a data format where one real identity is paired with multiple diverse synthetic images of that same identity
IPA: IP-Adapter—a common method for identity injection that modifies the cross-attention layers of the diffusion model
Rectified Flow: A method for training generative models by defining straight paths between noise and data distributions, used by FLUX
SFT: Supervised Fine-Tuning—further training a pre-trained model on high-quality data to improve specific capabilities