CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

📝 Paper Summary

Text-to-Image Generation Multimodal Reasoning Chain-of-Thought (CoT)

CoCo improves structured image generation by first generating executable code to render a precise draft image, which then guides a unified multimodal model to produce a high-fidelity final output.

Core Problem

Existing text-to-image models rely on abstract natural language planning, which lacks the precision required for complex spatial layouts, scientific diagrams, and dense textual content.

Why it matters:

Natural language is too abstract to strictly define coordinate systems, geometric constraints, or exact text placement needed for charts and plots
Current models frequently produce hallucinations or illegible text when creating scientific figures (e.g., mathematical plots) due to a lack of explicit visual grounding

Concrete Example: When prompted to generate a '2D plot of y=x^2', standard models often produce incorrect curves or gibberish axis labels. CoCo generates Python code to plot the exact function, renders a correct draft, and refines it into a polished image.

Key Novelty

Code-as-CoT (Executable Reasoning)

Replaces abstract textual 'thoughts' with executable Python code, which deterministically encodes spatial layouts and structural constraints
Uses the code execution result (a 'Draft Image') as an explicit visual scaffold, allowing the model to 'see' its plan before refining it into a high-fidelity image

Architecture

The overall inference pipeline of CoCo, showing the progression from text to code to draft to final image.

Evaluation Highlights

+68.83% improvement on StructT2IBench compared to the Bagel baseline (direct generation)
+54.8% improvement on OneIG-Bench compared to Bagel, showing better generalization on multilingual and stylized tasks
Outperforms text-based Chain-of-Thought approaches by 64.48% on StructT2IBench, validating that code is a superior reasoning medium for structure

Breakthrough Assessment

8/10

Significant conceptual shift from text-based planning to executable code-based planning for generation. Addresses a critical weakness (spatial/structural precision) in current diffusion models.

⚙️ Technical Details

Problem Definition

Setting: Structured Text-to-Image Generation

Inputs: Text prompt p describing a complex or structured visual scene

Outputs: High-fidelity image I_final adhering to the structural constraints of p

Pipeline Flow

Group 1: Code Generation (Text → Code)
Group 2: Draft Rendering (Code → Draft Image)
Group 3: Draft-Guided Refinement (Draft Image + Text → Final Image)

System Modules

Code Generator

Translate the user text prompt into executable Python code specifying layout and structure

Model or implementation: Bagel (Unified MLLM)

Sandbox Executor

Execute the generated code in a safe environment to render a deterministic draft

Model or implementation: Python Interpreter (Sandboxed)

Draft Encoder (Group 3)

Encode the draft image to preserve both high-level semantics and low-level details

Model or implementation: ViT Encoder + VAE Encoder

Image Refiner (Group 3)

Generate the final high-fidelity image conditioned on the prompt and draft encodings

Model or implementation: Bagel (Unified MLLM with Rectified Flow)

Novel Architectural Elements

Two-stage inference pipeline where the intermediate representation is executable code rather than text or latent vectors
Dual encoding of the intermediate draft image using both ViT (semantic) and VAE (pixel) encoders to guide the final generation

Modeling

Base Model: Bagel (Unified MLLM)

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Train the model to generate correct code.

Formally: Token-level cross-entropy loss on code tokens
Purpose: Train the model to refine drafts into final images.

Formally: Mean Squared Error (MSE) on VAE tokens (via Rectified Flow matching)

Training Data:

CoCo-10K Dataset (10,000+ samples)
Editing Dataset: Pairs of charts (Original -> Corrected) from StructVisuals
Synthesis Dataset: Triplets of (Text -> Code -> Draft -> Final) generated via Gemini-3-Pro (code) and Nano Banana (refinement)

Key Hyperparameters:

loss_type: Cross Entropy + MSE

Compute: Not reported in the paper

Comparison to Prior Work

vs. Image-Gen-CoT: CoCo uses constructive planning (code) rather than discriminative filtering (reward model)
vs. Bagel-CoT: CoCo uses executable code which is precise and verifiable, whereas natural language plans are abstract and prone to spatial errors
vs. IRG: CoCo constructs a controllable draft *prior* to synthesis via code, rather than refining post-hoc via visual feedback

Limitations

Relies on the model's ability to write valid, executable code; code generation failures break the pipeline
Two-stage process (code generation + image refinement) likely increases inference latency compared to direct generation
The approach is specifically optimized for structured/schematic images and may be less beneficial for purely artistic or abstract prompts

Reproducibility

Code: https://github.com/micky-li-hd/CoCo

Code is publicly available at https://github.com/micky-li-hd/CoCo. The CoCo-10K dataset is introduced and described for training.

📊 Experiments & Results

Evaluation Setup

Evaluation on structured, text-intensive, and complex concept generation tasks

Benchmarks:

StructT2IBench (Structured image synthesis (charts, tables, math figures))
OneIG-Bench (Multilingual text, stylized generation, compositional scenarios)
LongText-Bench (Rendering of extended textual content in images)

Metrics:

Relative improvement (%) over baselines
Likely QA-based accuracy or VLM-based evaluation scores (implied by benchmark choice, exact metric names not explicitly listed in snippet)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Qualitative comparison and visualization of CoCo's outputs.

Main Takeaways

Code is a more effective reasoning medium than natural language for structured image generation, as evidenced by large gains over text-CoT baselines (+64.48% on StructT2IBench).
The two-stage 'Draft-then-Refine' paradigm significantly outperforms direct generation, with improvements of +68.83% on StructT2IBench, +54.8% on OneIG-Bench, and +41.23% on LongText-Bench.
The method generalizes well to dense text rendering and multilingual tasks (LongText-Bench, OneIG-Bench), likely because code can explicitly specify text content and positions.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Unified Multimodal Models (UMMs) combining ViT and LLM components
Familiarity with Chain-of-Thought (CoT) reasoning
Basic knowledge of image generation via VAEs (Variational Autoencoders) and diffusion/flow matching

Key Terms

CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps (reasoning paths) before producing a final answer

UMM: Unified Multimodal Model—a single architecture capable of both understanding (encoding images) and generating (decoding images) alongside text processing

Bagel: A specific Unified Multimodal Model used as the backbone in this paper, utilizing a Mixture-of-Experts to handle text and image tokens

VAE: Variational Autoencoder—a neural network that compresses images into a latent space (tokens) and reconstructs them

ViT: Vision Transformer—a model architecture that processes images as sequences of patches (tokens), primarily for understanding tasks

Rectified Flow: A generative model formulation that learns to transport a simple noise distribution to a data distribution along straight paths, used here for image decoding