MLLM: Multi-modal Large Language Model—an AI model capable of processing and generating both text and images (e.g., SEED-X, Emu)
DiT: Diffusion Transformer—a diffusion model backbone that uses Transformer architecture instead of the traditional U-Net, used here as a high-fidelity image detokenizer
Detokenizer: A component that converts discrete or continuous image tokens (produced by the MLLM) back into a high-resolution pixel image
SEED-X: The specific MLLM architecture used as the backbone, which unifies multi-granularity comprehension and generation
ArcFace: A face recognition model used as a metric to calculate identity similarity scores between generated faces and reference faces
Personalization: Generating images of a specific subject (e.g., a specific person's face) in different contexts based on text prompts
Instruction Fine-Tuning: Training the model on datasets of (instruction, output) pairs to improve its ability to follow user commands
Chat-History Caching: Storing previous conversation turns (text and images) in memory so the model can attend to them during the current generation step