← Back to Paper List

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen
InstantX Team, Xiaohongshu Inc, Peking University
arXiv.org (2024)
MM P13N

📝 Paper Summary

Personalized Image Generation Identity Preservation Subject-driven Image Generation
InstantID achieves high-fidelity identity preservation in image generation without test-time fine-tuning by injecting strong semantic ID embeddings via a decoupled cross-attention adapter and a weak spatial ControlNet.
Core Problem
Existing personalized generation methods either require lengthy fine-tuning (DreamBooth, LoRA) or sacrifice high facial fidelity/editability (IP-Adapter) due to weak alignment of CLIP features.
Why it matters:
  • Real-world applications like AI portraits and e-commerce require instant results without high storage or training costs per user
  • Current tuning-free methods using CLIP embeddings capture style/composition but fail to preserve intricate facial identity details
  • Fine-tuning methods struggle with single-reference scenarios and cannot be easily deployed for mass usage
Concrete Example: When using IP-Adapter with a single face reference, the generated image may capture the general vibe or hair color but loses the specific facial identity. Conversely, LoRA requires training on multiple images, which is slow and storage-heavy.
Key Novelty
InstantID (IdentityNet + Decoupled ID Adapter)
  • Replaces weak CLIP vision embeddings with strong semantic face ID embeddings from a specialized face recognition model (antelopev2) to capture identity details
  • Uses a ControlNet-like module (IdentityNet) that conditions ONLY on facial landmarks (spatial) and ID embeddings (semantic) without text prompts, preventing leakage of non-ID attributes
Architecture
Architecture Figure Figure 2
The overall pipeline of InstantID showing the dual-branch injection of identity information.
Evaluation Highlights
  • Achieves competitive or superior fidelity to LoRA methods using only a single reference image without any fine-tuning
  • Preserves text editability better than IP-Adapter-FaceID-Plus, allowing style changes (e.g., gender, hair color) while keeping identity fixed
  • Demonstrates compatibility with existing ControlNets (canny, depth) and base models (SD1.5, SDXL) as a plug-and-play module
Breakthrough Assessment
9/10
Highly influential practical breakthrough. Solves the trade-off between fidelity and efficiency, enabling instant high-quality personalization that previously required expensive fine-tuning.
×