GRIT: Teaching MLLMs to Think with Images

📝 Paper Summary

Multimodal Reasoning Visual Grounding Reinforcement Learning for MLLMs

GRIT teaches multimodal models to interleave bounding box coordinates within textual reasoning chains using reinforcement learning and format-based rewards, requiring no human-annotated reasoning paths.

Core Problem

Current multimodal models generate reasoning chains in pure text, disconnected from specific image regions, and training grounded reasoning typically requires expensive, scarce datasets with step-by-step bounding box annotations.

Why it matters:

Pure text reasoning in MLLMs (Multimodal Large Language Models) often hallucinates or fails to ground logic in visual evidence
Existing solutions require dense, hard-to-obtain supervision (human annotations linking text steps to boxes)
Models struggle to maintain context across multiple images if using pixel-level inputs for every step

Concrete Example: When asking a complex visual question, a standard MLLM might say 'The man is holding a cup' without verifying the pixel region. GRIT produces '<think> The man [box_coords] is holding a cup [box_coords] </think>', forcing the model to explicitly locate the objects it references during the thought process.

Key Novelty

Grounded Reasoning with Images and Texts (GRIT)

Defines a 'grounded reasoning paradigm' where the model outputs interleaved text and bounding box coordinates within reasoning tags (<think>...</think>)
Uses a format-aware Reinforcement Learning algorithm (GRPO-GR) that rewards the *structure* of reasoning (presence of valid boxes and tags) and the *final answer*, rather than supervising the specific content of the intermediate steps

Architecture

The Grounded Reasoning Paradigm. It illustrates how the model generates a reasoning chain that interleaves natural language with bounding box coordinates, followed by a final answer.

Evaluation Highlights

Achieves grounded reasoning capability using only 20 image-question-answer triplets (from VSR and TallyQA datasets)
Successfully trains state-of-the-art MLLMs (Qwen 2.5-VL and InternVL 3) to unify reasoning and grounding without dense supervision

Breakthrough Assessment

8/10

The method unlocks a complex capability (interleaved visual-textual thinking) with extreme data efficiency (20 samples) via pure RL, removing the bottleneck of expensive reasoning annotations.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering with explicit reasoning generation

Inputs: Image I and textual question q

Outputs: Reasoning chain c (containing text T and bounding boxes B) followed by final answer a

Pipeline Flow

Input Processing (Image + Question)
Policy Generation (Reasoning Chain with Text + BBoxes)
Final Answer Generation

System Modules

Multimodal Policy

Generate interleaved reasoning chain and final answer

Model or implementation: Qwen 2.5-VL or InternVL 3

Novel Architectural Elements

Integration of bounding box coordinate generation directly into the textual reasoning stream without auxiliary detection heads or pixel masking inputs

Modeling

Base Model: Qwen 2.5-VL and InternVL 3

Training Method: GRPO-GR (Reinforcement Learning)

Objective Functions:

Purpose: Maximize expected reward relative to a group of sampled outputs, constrained by KL divergence.

Formally: J_GRPO(θ) = E[min(ratio * A_i, clip(ratio, 1-ε, 1+ε) * A_i) - β * D_KL]
Purpose: Reward proper formatting of reasoning tags.

Formally: s_st = 0.5 * I(correct think pair) + 0.5 * I(correct rethink pair)
Purpose: Reward presence of valid bounding boxes.

Formally: s_bf = 0.5 * I(num_bboxes >= 1)
Purpose: Reward answer accuracy using a model judge and text overlap.

Formally: r_ans = s_GPT (binary correctness from GPT-4o) + 0.1 * s_BLEU (sentence similarity)
Purpose: (Optional) Reward correct counting logic for counting tasks.

Formally: r_count = 0.5 * I(num_generated_boxes == ground_truth_count)

Adaptation: Full model update (implied by RL on policy parameters)

Training Data:

20 image-question-answer triplets drawn from VSR (Visual Spatial Reasoning) and TallyQA datasets

Key Hyperparameters:

delta: 10^-8 (for numerical stability in advantage)

Comparison to Prior Work

vs. VLM-R1: GRIT produces interpretable interleaved reasoning traces, not just final boxes
vs. VisCoT: GRIT requires no dense annotations (uses only Q-A pairs), whereas VisCoT relies on detailed reasoning+box labels
vs. DeepSeek-R1: GRIT extends the 'aha moment' of RL reasoning to include visual grounding actions (bounding boxes) explicitly

Limitations

Relies on the base model's inherent capacity to understand coordinates; does not introduce new visual encoders
Format rewards do not verify the semantic correctness of the bounding boxes, only their syntax and presence (indirectly optimized via answer accuracy)
Performance scaling with larger datasets reveals challenges in generalizability

Reproducibility

Prompt templates for GPT-4o judge and system prompts are in Appendix D. Specific GitHub URL not provided in the snippet.

📊 Experiments & Results

Evaluation Setup

Visual Question Answering and Referring Expression Comprehension

Benchmarks:

VSR (Visual Spatial Reasoning)
TallyQA (Visual Counting)

Metrics:

Answer Accuracy
Grounding/Format Compliance
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The GRPO-GR Reinforcement Learning Framework

Main Takeaways

Data Efficiency: The method enables MLLMs to acquire grounded reasoning capabilities with as few as 20 training samples.
Unification: Trained models successfully merge originally disconnected abilities (grounding and reasoning) into a single coherent output stream.
Self-Reinforcement: The generation of bounding boxes is observed to boost the accuracy of subsequent reasoning steps, suggesting the model uses its own grounding to 'focus'.
High Correlation: Qualitative analysis shows a strong link between the text generated and the image regions referenced, despite lacking explicit supervision for this alignment.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics (Policy Optimization)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting

Key Terms

MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and image data

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-relative rewards to reduce variance

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

VQA: Visual Question Answering—the task of answering natural language questions about an image

Bounding Box: A rectangular box defined by coordinates (usually x_min, y_min, x_max, y_max) that outlines a specific object in an image

REC: Referring Expression Comprehension—the task of localizing a specific image region described by a text query

BLEU: Bilingual Evaluation Understudy—a metric for evaluating text quality by measuring n-gram overlap with a reference text

RL: Reinforcement Learning—training models by rewarding desired behaviors rather than providing explicit correct answers for every step