Learning to Think Fast and Slow for Visual Language Models

📝 Paper Summary

Visual Language Models (VLMs) Visual Reasoning

DualMindVLM is a visual language model trained to automatically switch between concise responses (System 1) and detailed reasoning chains (System 2) based on problem difficulty, optimizing both accuracy and token efficiency.

Core Problem

Current reasoning-oriented VLMs are trained to always output long, step-by-step reasoning chains (System 2), even for simple perceptual tasks where such detail is unnecessary.

Why it matters:

Excessive token generation increases computational costs and latency for end-users
Existing models lack the human-like ability to dynamically allocate cognitive resources based on task difficulty
Forcing complex reasoning on simple tasks (overthinking) creates redundancy without improving accuracy

Concrete Example: When asking a model to recognize a simple emoji, a standard reasoning model (like one trained with GRPO) generates a long chain of thought analyzing pixel details before answering, whereas a human would recognize it instantly. The proposed model outputs 'Short Thinking:' and the answer immediately.

Key Novelty

Two-stage RL framework for automatic thinking-mode switching

Auto-labeling stage: Uses the base model's natural response length to classify training data as requiring 'fast' or 'slow' thinking without external supervision
Dual-mode RL stage: Trains the model to output a specific mode prefix ('Short Thinking' or 'Long Thinking') and switch strategies, using hybrid sampling where half the training rollouts are forced into the correct mode and half are free-form

Architecture

The two-stage training pipeline: Thinking Mode Auto-Labeling and Learning Dual-Mode Thinking.

Evaluation Highlights

+7.4% accuracy improvement on MathVista (Testmini) compared to the base Qwen2.5-VL-7B model
Reduces token usage by ~40% on average compared to the best-performing reasoning baselines while maintaining competitive accuracy
Outperforms state-of-the-art reasoning models on 4 out of 6 benchmarks (MathVista, MMStar, ScienceQA, AI2D)

Breakthrough Assessment

8/10

Elegantly solves the 'overthinking' problem in reasoning models using a self-supervised labeling approach. Achieves SOTA accuracy with significantly lower compute, a critical step for practical deployment.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) with adaptive reasoning length

Inputs: Image I and Query Q

Outputs: A thinking mode prefix p ('Short Thinking:' or 'Long Thinking:') followed by the answer y

Pipeline Flow

Input (Image + Question)
Thinking Mode Selection (Model predicts 'Short Thinking:' or 'Long Thinking:')
Answer Generation (Concise answer or Chain-of-Thought based on prefix)

System Modules

Base VLM

Unified model for both mode selection and answer generation

Model or implementation: Qwen2.5-VL-7B

Novel Architectural Elements

Prefix-conditional generation mechanism where specific tokens ('Short Thinking:', 'Long Thinking:') act as control signals for the subsequent reasoning depth
Self-supervised data splitting pipeline that categorizes tasks by inherent difficulty (proxy: output length) rather than external labels

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Reward correct answers.

Formally: r_a = 1 if correct else 0
Purpose: Enforce correct thinking mode format.

Formally: r_f = 1 if generated prefix matches label, else 0
Purpose: Optimize policy with KL constraint.

Formally: GRPO objective with KL penalty term -β * D_KL(π_θ || π_ref)

Training Data:

37,506 visual question-answer pairs total
18,778 slow-thinking samples (generated length > 200 tokens)
18,728 fast-thinking samples (generated length < 100 tokens)
Samples with 100-200 token length discarded to ensure separation

Key Hyperparameters:

learning_rate: 1e-6
rollout_batch_size: 256
kl_coefficient_beta: 1e-3
+ 3 more
max_generation_length: 2048
num_generations_per_sample_n: 8
free_form_generations_m: 4 (implied from 'half' strategy)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. R1-VL/OpenVLThinker: DualMindVLM dynamically shortens responses for easy tasks, whereas R1-VL/OpenVLThinker tend to over-reason on all tasks
vs. Chain-of-Draft: DualMindVLM learns to switch modes automatically via RL on visual tasks, whereas Chain-of-Draft focuses on concise intermediate steps in language [not cited in paper as direct baseline, but related work]
vs. AdaCoT [not cited in paper]: AdaCoT uses RL to penalize length in LLMs; DualMindVLM uses explicit prefix-conditional RL for VLMs to switch modes

Limitations

Mode selection bias: The model may over-rely on fast thinking for chart-related tasks due to training data correlations, occasionally missing necessary reasoning steps
Data scale sensitivity: Increasing training data does not consistently improve performance on scientific/perceptual tasks (ScienceQA, AI2D, etc.) unlike math tasks
Dependence on base model capability: The auto-labeling relies on the base model's initial ability to solve problems, potentially limiting the upper bound of 'slow' reasoning quality

Reproducibility

Code and models are promised to be publicly available but no URL is currently provided. Hyperparameters are detailed. Dataset composition and sources are listed.

📊 Experiments & Results

Evaluation Setup

Multimodal benchmarks covering math, science, and general visual understanding

Benchmarks:

MathVista (Mathematical reasoning (Testmini))
MathVision (Mathematical reasoning (Test))
MMStar (General visual understanding)
MMBench (EN) (General visual understanding)
ScienceQA (Scientific QA)
AI2D (Scientific diagram understanding)
HumbleBench (Hallucination evaluation)

Metrics:

Accuracy
Average Token Length
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DualMindVLM consistently outperforms the base model Qwen2.5-VL while using fewer tokens.
MathVista	Accuracy	68.2	75.6	+7.4
MathVision	Accuracy	25.1	30.2	+5.1
ScienceQA	Accuracy	93.0	96.2	+3.2
DualMindVLM achieves state-of-the-art or competitive performance against reasoning-specialized VLMs.
MathVista	Accuracy	73.2	75.6	+2.4
MMStar	Accuracy	65.4	67.0	+1.6
MathVista	Token Length	759	184	-575
Ablation studies confirm the necessity of auto-labeling and thinking-mode labels.
MathVista	Accuracy	72.6	75.6	+3.0
MathVision	Accuracy	28.5	30.2	+1.7

Experiment Figures

Comparison of token usage and reasoning between Base, GRPO (System 2 only), and DualMindVLM.

Cumulative accuracy vs. token budget on MMStar benchmark.

Main Takeaways

Effectiveness of Dual-Mode: Automatically switching between fast and slow thinking yields SOTA performance while significantly reducing token costs compared to pure System 2 models.
Efficiency: Reduces token usage by ~40% on average compared to reasoning baselines; specifically, fast thinking output remains below 50 tokens while slow thinking scales with difficulty.
Importance of Auto-Labeling: Without explicit thinking-mode labels derived from response length, RL training collapses into the easier/faster mode, degrading reasoning performance.
Hallucination Reduction: The dual-mode approach significantly outperforms competitors on HumbleBench, suggesting better grounding and fewer hallucinations.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Human Feedback (RLHF) concepts
Visual Language Models architecture (Qwen-VL)
Chain-of-Thought (CoT) prompting

Key Terms

System 1: Fast, automatic, and intuitive thinking (e.g., recognizing an object instantly)

System 2: Slow, deliberate, and analytical thinking (e.g., solving a math problem step-by-step)

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that updates a policy based on the relative performance of a group of generated outputs for the same input

VLM: Visual Language Model—an AI model capable of processing and understanding both image and text inputs

Rollout: A complete sequence of text generated by the model during the sampling phase of reinforcement learning

Thinking Mode Auto-Labeling: The process of categorizing training data as 'fast' or 'slow' based on the length of answers generated by the pre-trained model

Hybrid Group Response Sampling: A training strategy where half of the model's outputs are forced to use a specific thinking prefix (fast/slow) and the other half are generated freely, to help the model learn the association

KL penalty: Kullback-Leibler divergence penalty—a regularization term used in RL to prevent the trained model from deviating too drastically from the reference model

Hallucination: When a model generates plausible-sounding but factually incorrect or non-existent information