← Back to Paper List

ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

Yu-xin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, O. Deussen, Changsheng Xu
Chinese Academy of Sciences, University of Konstanz, Kuaishou Technology
ACM Transactions on Graphics (2023)
MM P13N

πŸ“ Paper Summary

Text-to-Image Generation Personalization (P13N)
ProSpect decomposes the textual conditioning of diffusion models into stage-specific embeddings, exploiting the observation that models generate layout, content, and style at different frequency stages.
Core Problem
Existing personalization methods invert an image into a single static textual embedding, which fails to disentangle specific visual attributes like material, style, and layout.
Why it matters:
  • Using a single embedding for all diffusion steps limits editability, as changing the prompt often alters the entire image structure rather than just the desired attribute
  • Current methods like Textual Inversion or DreamBooth struggle to transfer a specific style (e.g., 'yarn texture') without bleeding content (e.g., 'cat shape') into the new image
  • Fine-tuning models for every concept is computationally expensive and difficult to combine for multi-attribute generation
Concrete Example: When DreamBooth tries to transfer a 'yarn' material from a cat image to a sphere, it tends to generate cat-like objects because it cannot separate the material from the content. ProSpect can apply the yarn texture while respecting the sphere's original shape.
Key Novelty
Prompt Spectrum Space (P*) and ProSpect Inversion
  • Expands the textual conditioning space from a single vector to a sequence of vectors (Prompt Spectrum), where each vector guides a specific stage of the denoising process
  • Leverages the discovery that diffusion models generate attributes in a frequency-based order: layout (low frequency) β†’ content β†’ material/style (high frequency)
  • Uses a hypernetwork to invert a single image into this spectrum of embeddings, allowing independent swapping of layout, content, or style embeddings during generation
Architecture
Architecture Figure Figure 2
Comparison of standard Textual Inversion space (P) vs. Prompt Spectrum Space (P*).
Breakthrough Assessment
7/10
Offers a clever insight into the temporal dynamics of diffusion generation (frequency/stages) to solve the entanglement problem in personalization without full model fine-tuning.
×