Evaluation Setup
Qualitative and quantitative comparison on text-guided image manipulation tasks (motion, background, style)
Benchmarks:
- Custom dataset (Ted from Imagic + LAION) (Image editing)
Metrics:
- CLIP Score (Semantic Alignment)
- Identity Preservation (L2 distance/Similarity)
- User Study (1-5 scale for Identity and Semantic alignment)
- Training Time
- Statistical methodology: User study with 20 participants via Google Forms
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Quantitative comparison shows HiPer outperforms Stable Diffusion-based baselines in user preference and achieves competitive CLIP scores with significantly faster training. |
| User Study |
Semantic Alignment (1-5) |
3.731 |
4.520 |
+0.789
|
| User Study |
Identity Preservation (1-5) |
3.251 |
4.099 |
+0.848
|
| Training Time |
Minutes |
14.08 |
3.0 |
-11.08
|
| CLIP Score |
Text Alignment |
0.1955 |
0.2047 |
+0.0092
|
Main Takeaways
- HiPer effectively separates semantic content (head) from identity (tail), allowing diverse edits (motion, style) without losing the subject.
- Increasing the number of personalized tokens (N) improves identity but reduces editability (overfitting); N=5 is the optimal sweet spot.
- Cross-attention analysis confirms the 'tail' tokens in standard CLIP embeddings are uninformative, making them ideal candidates for carrying personalized identity info without interfering with the prompt's semantic structure.
- Imagic (when run on Stable Diffusion) suffers from poor identity preservation due to embedding interpolation; HiPer's concatenation strategy preserves structure better.