CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate reasoning steps before the final answer
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, eliminating the need for a separate value function
ULM: Unified Large Multi-modal Model—a model capable of both understanding and generating text and images within a single transformer framework
Semantic-level CoT: Textual reasoning generated prior to the image, planning the scene layout and object details (e.g., 'I should draw a cat on the left...')
Token-level CoT: The sequential generation of discrete image tokens (patches), viewed as a reasoning chain where each patch conditions on previous ones
BiCoT-GRPO: The proposed RL method that jointly optimizes both Semantic-level and Token-level CoT within one training step
VQGAN: Vector Quantized Generative Adversarial Network—an autoencoder that compresses images into discrete tokens
KL divergence: A statistical distance measure used as a penalty to prevent the RL-tuned model from drifting too far from the original reference model