MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

📝 Paper Summary

Multimodal Chain-of-Thought (CoT) Mathematical Reasoning

MINT-CoT improves mathematical reasoning in multimodal models by dynamically selecting and interleaving fine-grained visual tokens directly into textual reasoning steps using a specialized Interleave Token.

Core Problem

Existing multimodal math reasoning methods rely on coarse bounding boxes or insufficiently trained vision encoders, failing to capture the complex, interconnected fine-grained details necessary for solving math problems.

Why it matters:

Box-based methods (like cropping) often include irrelevant background noise or miss connected geometric elements, confusing the model
Standard vision encoders (CLIP, SigLIP) are trained on natural images and struggle with out-of-distribution mathematical diagrams
Current approaches often depend on external tools or separate detection models, increasing computational cost and complexity

Concrete Example: In a geometry problem where visual information is highly interconnected (e.g., intersecting lines), box-based methods crop a rectangular region that includes distracting elements. MINT-CoT, instead, selects only the specific visual tokens representing 'line segment AB' or 'angle DOC' relevant to the current reasoning step.

Key Novelty

Mathematical Interleaved Token (MINT) Selection

Introduces a special 'Interleave Token' that acts as a bridge; when generated, it compares its state with all visual tokens to find the most relevant ones
Instead of predicting bounding boxes, the model directly selects soft visual tokens based on similarity scores, allowing for arbitrary shapes (lines, curves) rather than just rectangles
Uses a progressive training pipeline moving from text-only reasoning to supervised interleaved training, and finally reinforcement learning to refine token selection

Architecture

Overview of the MINT-CoT framework, illustrating how the Interleave Token selects visual tokens during the generation process.

Evaluation Highlights

+34.08% improvement on MathVista compared to the baseline model (MINT-CoT-7B)
+28.78% improvement on GeoQA compared to the baseline model
+23.2% improvement on MMStar compared to the baseline model

Breakthrough Assessment

8/10

Significant performance jumps (>20-30%) on major benchmarks. The shift from bounding boxes to direct token selection for math reasoning is a methodologically sound and effective innovation.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Chain-of-Thought (CoT) reasoning for mathematics

Inputs: Image I and mathematical question/instruction T

Outputs: Sequence of interleaved textual steps s and selected visual tokens v, ending in a final answer

Pipeline Flow

Vision Encoder (extracts features)
LLM Backbone (processes text + visual features)
Interleave Token Mechanism (selects specific visual tokens during generation)

System Modules

Vision Encoder

Extract visual features from the input image

Model or implementation: Not explicitly specified in text (likely SigLIP or CLIP based on context)

Interleave Token Projector (Retrieval & Selection)

Project the hidden state of the Interleave Token to the same space as visual tokens for comparison

Model or implementation: Linear projection (P_post_intlv)

Token Selector (Retrieval & Selection)

Select visual tokens based on cosine similarity with the Interleave Token embedding

Model or implementation: Cosine similarity + Thresholding

LLM Backbone

Generate reasoning text and Interleave Tokens autoregressively

Model or implementation: 7B parameter model (MINT-CoT-7B)

Novel Architectural Elements

Interleave Token mechanism: A dedicated token that triggers a dynamic, similarity-based retrieval of fine-grained visual tokens from the encoder's output, bypassing bounding box proposals.

Modeling

Base Model: 7B parameter model (MINT-CoT-7B)

Training Method: Three-stage pipeline: Text-only CoT -> Interleaved CoT SFT -> Interleaved CoT RL

Objective Functions:

Purpose: Train the model to generate correct text tokens.

Formally: Standard Cross-Entropy Loss on text tokens.
Purpose: Train the model to select the correct visual tokens when an Interleave Token is generated.

Formally: Binary Cross-Entropy Loss on cosine similarity scores between Interleave Token and all visual tokens (labels derived from grid annotations).
Purpose: Optimize the reasoning policy using reinforcement learning based on answer correctness.

Formally: GRPO (Group Relative Policy Optimization) loss comparing advantages of generated chains within a group.

Training Data:

54K samples derived from Mulberry-260K
Pipeline: (1) Grid-index images, (2) OCR to map text to indices, (3) Extract keywords via GPT-4o, (4) Align keywords to visual regions via GPT-4o

Key Hyperparameters:

group_size_G: Used in GRPO (value not explicitly stated in text, typically 4-16)
threshold_theta: Predefined threshold for token selection (value not explicitly stated)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Visual-CoT: Selects soft tokens of arbitrary shapes rather than rigid bounding boxes
vs. ICoT: Explicitly trains a selection mechanism (Interleave Token) rather than relying on raw attention maps
vs. Visual SKETCHPAD: Does not require external drawing tools or API calls
+ 1 more
vs. MVoT: Applicable to general math reasoning, not just spatial planning/generation

Limitations

Reliance on the quality of the automated data generation pipeline (GPT-4o annotations)
Vision encoder limitations: if the underlying encoder (e.g., CLIP/SigLIP) fundamentally misses features, the selection mechanism cannot recover them
Computational cost of processing interleaved visual tokens increases sequence length compared to text-only CoT

Reproducibility

Code: https://github.com/xinyan-cxy/MINT-CoT

Code and data are available at https://github.com/xinyan-cxy/MINT-CoT. The dataset construction pipeline is described in detail. Specific hyperparameters (learning rates, batch sizes) are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Multimodal mathematical reasoning across various domains (geometry, general math)

Benchmarks:

MathVista (Visual mathematical reasoning)
GeoQA (Geometry problem solving)
MMStar (Multimodal capability evaluation)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MathVista	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+32.59
MathVista	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+34.08
GeoQA	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+28.78
MMStar	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+23.2

Experiment Figures

Comparison between Box-shaped Selection (Visual-CoT) and MINT-CoT (Token Selection).

The data generation pipeline for the MINT-CoT dataset.

Main Takeaways

MINT-CoT significantly outperforms baselines across multiple benchmarks (MathVista, GeoQA, MMStar), demonstrating the effectiveness of interleaved visual tokens.
The method is particularly effective for geometry problems where fine-grained visual perception is crucial.
The three-stage training strategy (Text CoT -> Interleaved SFT -> RL) is essential for achieving these results.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting
Basic Transformer architecture (visual tokens, hidden states)
Reinforcement Learning (specifically GRPO or PPO variants)

Key Terms

CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer

Interleave Token: A special token introduced by this paper that triggers the selection of relevant visual tokens from the image encoder to be inserted into the text stream

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to learn a specific behavior or format

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies by comparing a group of outputs against each other rather than using a separate value function

OCR: Optical Character Recognition—technology used to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data

Visual tokens: The discrete vector representations of image patches produced by a vision encoder (like ViT)

Grid-indexed images: Images overlaid with a grid where each cell has a unique index, used here to create ground-truth labels for visual token selection

RL: Reinforcement Learning—training models by rewarding desired behaviors (correct answers) and punishing undesired ones