Test-Time Scaling (TTS): Allocating additional computational resources (e.g., more tokens, more generation rounds) during inference to improve model performance.
Budget Forcing: A technique to control inference cost by forcing the model to continue generating reasoning steps or refinement rounds until a specific limit (budget) is reached.
Bagel: The underlying unified multimodal model architecture used in this paper, capable of processing and generating both text and images.
Nested CFG: An inference strategy applying classifier-free guidance sequentially: first text guidance, then image guidance on top, to control prompt adherence and visual consistency separately.
Subgoal Decomposition: Breaking a complex instruction into sequential planning steps (e.g., fixing object A first, then object B).
OneIG-Bench: A benchmark for evaluating instruction-following capability in image generation.
LPIPS: Learned Perceptual Image Patch Similarity—a metric used to measure the perceptual difference between two images.