Evaluation Setup
Evaluation on structured, text-intensive, and complex concept generation tasks
Benchmarks:
- StructT2IBench (Structured image synthesis (charts, tables, math figures))
- OneIG-Bench (Multilingual text, stylized generation, compositional scenarios)
- LongText-Bench (Rendering of extended textual content in images)
Metrics:
- Relative improvement (%) over baselines
- Likely QA-based accuracy or VLM-based evaluation scores (implied by benchmark choice, exact metric names not explicitly listed in snippet)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Code is a more effective reasoning medium than natural language for structured image generation, as evidenced by large gains over text-CoT baselines (+64.48% on StructT2IBench).
- The two-stage 'Draft-then-Refine' paradigm significantly outperforms direct generation, with improvements of +68.83% on StructT2IBench, +54.8% on OneIG-Bench, and +41.23% on LongText-Bench.
- The method generalizes well to dense text rendering and multilingual tasks (LongText-Bench, OneIG-Bench), likely because code can explicitly specify text content and positions.