Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li
Multimedia Laboratory (MMLab), The Chinese University of Hong Kong,
Huawei Research,
Beihang University
arXiv.org
(2025)
MMReasoningPretrainingBenchmark
📝 Paper Summary
Visual Chain-of-Thought (VCoT)Multimodal Mathematical ReasoningDiagram Generation and Editing
MathCanvas enables unified Large Multimodal Models to perform intrinsic visual reasoning by training them to generate and strategically interleave diagrammatic edits within their textual chain-of-thought.
Core Problem
Existing LMMs lack the ability to generate precise mathematical diagrams or strategically deciding when and how to use them, often producing flawed visuals that act as mere decoration rather than reasoning aids.
Why it matters:
Geometry and function analysis intrinsically require visual aids for human-like problem solving; text-only reasoning is insufficient.
Prior Visual CoT methods rely on rigid external tools (e.g., code interpreters) that lack flexibility, while intrinsic methods have failed to produce high-fidelity diagrams needed for complex deduction.
Concrete Example:In a geometry problem requiring an auxiliary line, a baseline model (Nano-Banana) generates a visual that is a 'flawed decoration' failing to reveal the key insight. Another baseline (BAGEL-Zebra-CoT) draws a geometrically incorrect figure, rendering it useless for deduction.
Decouples training into two phases: first mastering the 'hand' (drawing/editing diagrams via MathCanvas-Edit/Imagen), then mastering the 'mind' (strategic interleaving via MathCanvas-Instruct).
Treats visual aids as dynamic, editable reasoning steps rather than static final outputs, allowing the model to 'think' visually by iteratively refining diagrams.
Architecture
The two-stage training recipe for BAGEL-Canvas: Stage I for Visual Manipulation and Stage II for Strategic Visual-Aided Reasoning.
Evaluation Highlights
Achieves an 86% relative improvement over strong LMM baselines on the proposed MathCanvas-Bench test set.
Demonstrates generalization to other public math benchmarks (qualitative claim based on abstract).
Breakthrough Assessment
8/10
Proposes a comprehensive framework and massive datasets (15M+ pairs) that address a fundamental gap in multimodal reasoning—intrinsic diagram generation. The reported 86% relative improvement suggests a significant leap over existing tool-based or text-centric approaches.
⚙️ Technical Details
Problem Definition
Setting: Multimodal mathematical problem solving with interleaved visual-textual output
Inputs: A math problem P consisting of text T and optionally an initial image I
Outputs: A solution sequence S containing interleaved text tokens and generated visual tokens (diagrams), concluding with the final answer
Encodes input text and initial images into latent representations
Model or implementation: Transformer-based Encoder (part of BAGEL architecture)
Generation Expert
Generates text reasoning steps and predicts when to trigger diagram generation
Model or implementation: Transformer-based Decoder (part of BAGEL architecture)
Visual Generator
Synthesizes mathematical diagrams when triggered
Model or implementation: Generation Expert (Dual role) with Rectified-Flow Head
Novel Architectural Elements
Integration of diagram generation directly into the reasoning loop via a <|vision_start|> token trigger, treating image generation as a latent thought step
Modeling
Base Model: BAGEL (Unified LMM)
Training Method: Two-stage Supervised Fine-Tuning (SFT)
Objective Functions:
Purpose: Train the model to generate high-fidelity diagrams.
Formally: Rectified-Flow Loss (Stage I & II)
Purpose: Train the model to decide when to draw and how to reason.
Formally: Autoregressive Next-Token Prediction Loss (Stage II)
Project page available at https://mathcanvas.github.io/. The paper describes the data construction pipeline in detail (using AlphaGeometry, GPT-4, etc.) but does not explicitly state if the full 15M+ dataset or model weights are released. Evaluation prompts are in Appendix C.
📊 Experiments & Results
Evaluation Setup
Generative mathematical problem solving requiring interleaved text and diagrams
Benchmarks:
MathCanvas-Bench (Multimodal Math Problem Solving) [New]
Metrics:
Complete Accuracy (Binary)
Weighted Scoring (Partial credit for sub-questions)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
The data construction pipeline for the pretraining corpus (MathCanvas-Edit and MathCanvas-Imagen).
Main Takeaways
MathCanvas achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, validating the efficacy of intrinsic VCoT.
The two-stage training strategy allows the model to first master visual execution (how to draw) and then visual strategy (when to draw), mirroring human learning.
Visual aids generated by the model are not just decorative but functional, enabling the model to solve geometry problems that text-only baselines fail on.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Large Multimodal Models (LMMs)
Chain-of-Thought (CoT) prompting
Basics of generative image models (e.g., VAE, Diffusion/Flow)
Key Terms
VCoT: Visual Chain-of-Thought—a reasoning process where models generate visual aids (like diagrams) alongside text to solve problems
LMM: Large Multimodal Model—a single model capable of processing and generating both text and images
Rectified-Flow Loss: A loss function used for training generative models to create high-quality images, used here for the diagram generation component
Intrinsic VCoT: Visual reasoning where the model natively generates images as part of its thought process, rather than calling external tools like Python plotters
AlphaGeometry: A neuro-symbolic system for geometry proof solving, used here to mine valid geometry problems and edit trajectories
Interleaved Reasoning: A generation mode where text and images are produced in a mixed sequence, allowing diagrams to appear exactly when needed in the logical flow