Conversational Image Generation: Towards Multi-Round Personalized Generation with Multi-Modal Language Models

📝 Paper Summary

Conversational personalization Multi-modal generation

This paper enables Multi-modal Large Language Models to perform multi-round personalized image generation by replacing the standard detokenizer with a personalization-enhanced Diffusion Transformer and utilizing a new chat-history caching mechanism.

Core Problem

Existing personalization methods (like DreamBooth or InstantID) operate in single-round settings and lack conversational context, while current MLLMs fail to preserve fine-grained facial identity details due to weak detokenizers.

Why it matters:

Current diffusion models cannot handle multi-turn interactions, forcing users to restart generation tasks from scratch rather than iterating via dialogue
Vanilla MLLMs trained on general data struggle to reconstruct specific human identities, limiting their utility for personalized creative applications
No existing datasets support the development of models that need to reason across interleaved text-image chat history for consistent character generation

Concrete Example: In a chat, a user generates an image of 'Olivia' in Round 1. In Round 2, the user asks for 'a close-up of Olivia' without re-describing her. Standard models fail because they cannot retrieve 'Olivia's' visual features from the Round 1 history to inform Round 2 generation.

Key Novelty

Conversational MLLM with DiT Detokenizer

Identifies that standard VQGAN-like detokenizers in MLLMs bottleneck identity preservation, and replaces them with a Diffusion Transformer (DiT) specifically fine-tuned on human faces to reconstruct fine details from image tokens
Implements a chat-history caching mechanism that allows the MLLM to attend to past visual outputs and textual descriptions, enabling consistent character generation across multiple dialogue turns without re-prompting

Architecture

Overview of the framework and data pipeline, showing the MLLM processing interleaved text/image tokens and the Visual Decoder reconstructing the image.

Evaluation Highlights

Achieves 0.293 ArcFace score (identity similarity) in single-round personalization, significantly outperforming the base SEED-X model's score of 0.094
Human evaluation shows a 73.75% preference for the proposed method's image quality over SEED-X
71.25% human preference win rate for Face Identity preservation compared to the SEED-X baseline

Breakthrough Assessment

8/10

First work to enable true multi-round conversational personalization where an MLLM retrieves visual identity from chat history. Significant architectural improvement (DiT detokenizer) addresses a major MLLM bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Multi-round conversational image generation where the model must generate image I_a based on current text T_t, current visual input I_v, and the full history of past inputs and outputs.

Inputs: Current text prompt T_t, optional current image I_v, and cached chat history {T_k, I_k, Output_k} from previous turns

Outputs: Generated image I_a that aligns with the prompt while maintaining identity consistency with subjects defined in the chat history

Pipeline Flow

Input Processing (Chat History + Current Prompt)
MLLM Reasoning (LLaMA)
Visual Detokenization (Personalization-Enhanced DiT)

System Modules

Visual Encoder

Encodes input images into visual embeddings/tokens

Model or implementation: Qwen-VL Image Encoder (frozen)

Multi-Modal LLM

Processes interleaved text and image tokens to predict next tokens (autoregressive generation)

Model or implementation: LLaMA-2 (from SEED-X, fine-tuned)

Visual Detokenizer

Reconstructs the final high-fidelity image from the MLLM's predicted image tokens

Model or implementation: Diffusion Transformer (DiT), fine-tuned on human faces

Novel Architectural Elements

Replacement of SDXL-based reconstruction detokenizer with a fine-tuned Diffusion Transformer (DiT) specifically for MLLM visual decoding
Integration of chat-history caching in personalization tasks to enable retrieval of visual identities from previous turns

Modeling

Base Model: SEED-X (LLaMA-2 backbone + Qwen-VL encoder)

Training Method: Multi-stage Instruction Fine-tuning

Objective Functions:

Purpose: Optimize text token prediction.

Formally: Cross-entropy loss on text tokens
Purpose: Optimize image token prediction.

Formally: Regression loss (MSE) on continuous image embeddings
Purpose: Optimize detokenizer reconstruction.

Formally: Diffusion noise prediction loss on DiT

Adaptation: Fine-tuning of LLaMA parameters and DiT detokenizer weights

Trainable Parameters: LLaMA backbone (for instruction following) and DiT (for reconstruction)

Training Data:

Single-round: Personalization dataset from [13]
Multi-round: 92,471 samples derived from video clips. Pairs first frame (captioned + named) with last frame (target for personalization).
Full-body: 24,793 synthetic training samples filtered by ArcFace

Key Hyperparameters:

image_tokens: 64 (pooled from 256)
input_resolution: 512x512 (implied by video filtering)
detokenizer_initialization: Initialized with a text-to-image model [30]

Compute: Not reported in the paper

Comparison to Prior Work

vs. SEED-X: Replaces SDXL detokenizer with DiT; adds multi-round chat history reasoning
vs. Emu2: Achieves better identity preservation in conflicting contexts (e.g., pirate captain vs. female face)
vs. PhotoMaker/InstantID: Enables conversational interactions and multi-round context, whereas these are single-round only
+ 1 more
vs. DreamBooth: Does not require per-subject fine-tuning; handles personalization via context [not cited in paper but implied comparison]

Limitations

Detokenizer still struggles with perfect reconstruction compared to VQGANs due to limited token count (64 tokens)
Full-body multi-round personalization shows reduced face preservation compared to close-up due to synthetic training data noise
Evaluation relies heavily on human study and specific datasets; no standard public benchmark for multi-round personalization exists
Model struggles with complex prompts that require significant face changes (e.g., 'sticking out tongue') in Stage 2 training

Reproducibility

Code: https://github.com/Haochen-Zhang/Conversational-Personalization

Code is publicly available. The dataset construction method (video clips -> name-based pairs) is described in detail. Specific hyperparameters for learning rate or training duration are not provided in the main text. Relies on pre-trained SEED-X and Qwen-VL checkpoints.

📊 Experiments & Results

Evaluation Setup

Single-round and multi-round personalization tasks. Single-round uses a subset of [13] (400 samples). Multi-round uses a custom video-based dataset.

Benchmarks:

Single-Round Personalization Subset (Subject-driven generation)
Name-Based Multi-Round Personalization Dataset (Multi-turn conversational generation) [New]

Metrics:

ArcFace Score (Identity Preservation)
CLIP Score (Text Alignment)
PSNR (Reconstruction Quality)
Human Evaluation (Quality, Alignment, Face ID)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Detokenizer ablation studies demonstrating the superiority of the proposed DiT over standard SDXL-based approaches for reconstruction.
COCO2014 subset	PSNR	14.50	10.72	-3.78
Quantitative metrics on single-round personalization showing identity preservation improvements.
Personalization Dataset [13]	ArcFace	0.094	0.293	+0.199
Personalization Dataset [13]	CLIP Score	28.36	28.59	+0.23
Human evaluation results comparing the proposed method against the SEED-X baseline.
Internal Evaluation	Face ID Win Rate	2.5	71.25	+68.75
Internal Evaluation	Image Quality Win Rate	3.75	73.75	+70.00

Experiment Figures

Comparison of reconstruction quality between SDXL detokenizers (Stage 1 & 2) and the proposed DiT detokenizer.

Main Takeaways

DiT detokenizer significantly improves facial detail reconstruction compared to SDXL-based detokenizers which suffer from artifacts or identity loss
Multi-stage fine-tuning is crucial: Stage 1 (reconstruction) -> Stage 2 (masked face prediction) -> Stage 3 (paired identity training) balances editability and identity preservation
The model successfully performs multi-round reasoning: it can generate 'Olivia' in Round 2 by retrieving her appearance from the Round 1 image output based solely on her name
Existing MLLMs like SEED-X act mostly as text-to-image models, often ignoring the visual condition (face) in favor of the text prompt (e.g., generating a male pirate despite a female input face)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-modal Large Language Models (MLLMs) and tokenization
Familiarity with Diffusion Models and Diffusion Transformers (DiT)
Basic knowledge of personalization techniques (DreamBooth, LoRA, ID embedding)

Key Terms

MLLM: Multi-modal Large Language Model—an AI model capable of processing and generating both text and images (e.g., SEED-X, Emu)

DiT: Diffusion Transformer—a diffusion model backbone that uses Transformer architecture instead of the traditional U-Net, used here as a high-fidelity image detokenizer

Detokenizer: A component that converts discrete or continuous image tokens (produced by the MLLM) back into a high-resolution pixel image

SEED-X: The specific MLLM architecture used as the backbone, which unifies multi-granularity comprehension and generation

ArcFace: A face recognition model used as a metric to calculate identity similarity scores between generated faces and reference faces

Personalization: Generating images of a specific subject (e.g., a specific person's face) in different contexts based on text prompts

Instruction Fine-Tuning: Training the model on datasets of (instruction, output) pairs to improve its ability to follow user commands

Chat-History Caching: Storing previous conversation turns (text and images) in memory so the model can attend to them during the current generation step