unified MLLM: An architecture (like Chameleon or Lumina-mGPT) that handles both vision and language understanding and generation within a single transformer model
X-CoT: Cross-modal Chain-of-Thought—a reasoning strategy where the model generates intermediate text and images (e.g., a cropped subject image) to plan before generating the final output
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing outcomes within a group of samples rather than using a separate value function
DreamSim: A perceptual metric used to evaluate visual similarity between images, focusing on high-level semantic features and layout rather than just pixel alignment
PickScore: A metric that predicts human preference for text-image alignment, used here as a reward signal for prompt consistency
cold-start: The initial supervised fine-tuning phase using a synthetic dataset to teach the model the basic format and reasoning pattern before RL optimization
subject fidelity: The degree to which the generated image preserves the identity and key visual attributes of the specific subject from the reference image