SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

📝 Paper Summary

Visual Reasoning Vision-Language Models (VLMs) Reinforcement Fine-Tuning (RFT)

ThinkLite-VL achieves state-of-the-art visual reasoning by selecting a small subset of high-difficulty training samples using MCTS iteration counts as a proxy for hardness, then applying reinforcement fine-tuning without distillation.

Core Problem

Current VLM reasoning improvements rely on cumbersome pipelines involving knowledge distillation and large datasets, often failing to leverage self-improvement effectively due to poor sample selection.

Why it matters:

Distillation pipelines are computationally expensive and limit models to the teacher's capacity.
Existing RFT methods for VLMs struggle because they train on samples that are either too easy (trivial) or too hard (unsolvable), leading to inefficient learning.
Reliably quantifying 'sample difficulty' for multimodal tasks remains non-trivial and unaddressed in scalable ways.

Concrete Example: A VLM might easily solve a simple chart question but fail repeatedly on a complex geometry proof. Training on the easy chart wastes compute, while training on the geometry proof without intermediate feedback fails. ThinkLite-VL identifies the geometry proof as 'challenging but solvable' (high MCTS iterations) and prioritizes it for RFT.

Key Novelty

ThinkLite-VL (MCTS-Guided Sample Selection for RFT)

Repurposes Monte Carlo Tree Search (MCTS) from an inference tool to a training data filter, using the number of reasoning iterations required to solve a problem as a 'difficulty score'.
Demonstrates that Reinforcement Fine-Tuning (RFT) on a small, curated subset of 'appropriately challenging' samples (high iteration count) is far more effective than training on larger, random datasets.
Eliminates the need for Supervised Fine-Tuning (SFT) or knowledge distillation, achieving self-improvement purely through difficulty-aware RFT.

Architecture

Conceptual pipeline: Data Pool -> MCTS Difficulty Evaluation -> Filtered High-Difficulty Subset -> RFT Training -> ThinkLite-VL.

Evaluation Highlights

ThinkLite-VL-7B achieves 75.1% on MathVista, a new SoTA for 7B models, surpassing GPT-4o (63.8%) and Qwen2.5-VL-72B (71.9%) on this benchmark.
ThinkLite-VL-72B achieves 79.7% on MathVista, improving 4.42 points on average over the open-source SoTA.
Achieves these results using only 11k samples for the 7B model and 7.5k for the 72B model, an order of magnitude less data than typical instruction tuning sets.

Breakthrough Assessment

9/10

Achieves SoTA on major benchmarks (MathVista) with significantly less data and no distillation, challenging the prevailing paradigm that VLMs require massive SFT or teacher distillation for reasoning.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning tasks requiring sequential thought generation (Chain-of-Thought)

Inputs: Image I and text prompt x

Outputs: Reasoning chain followed by final answer

Pipeline Flow

Data Collection (70k samples from math, charts, science)
MCTS Difficulty Estimation (Run MCTS with base model to find iterations K needed to solve)
Filtering (Select samples where K > 5 and K < 50)
Reinforcement Fine-Tuning (Train base model on filtered subset using GRPO)

System Modules

Policy Model (Base VLM) (MCTS & Training)

Generates reasoning steps and final answers; serves as both the MCTS policy and the model being trained

Model or implementation: Qwen2.5-VL-7B-Instruct / Qwen2.5-VL-72B-Instruct

MCTS Engine

Explores reasoning paths to determine sample difficulty (iteration count K)

Model or implementation: Algorithmic search wrapper around Base VLM

Critic / Verifier (MCTS & Training)

Verifies correctness of generated answers against ground truth

Model or implementation: Rule-based comparison or LLM-based judge (Qwen2.5-VL-7B-Instruct)

Novel Architectural Elements

Use of MCTS iteration count specifically as a pre-training data filter metric (novel application of search)
Pipeline completely bypasses SFT and distillation, relying solely on difficulty-filtered self-play (RFT)

Modeling

Base Model: Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-72B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward of generated reasoning chains while staying close to the reference policy.

Formally: J_GRPO(θ) = E[ 1/G * sum( min( ratio * A, clip(ratio, 1-ε, 1+ε) * A ) ) - β * D_KL(π_θ || π_ref) ]

Adaptation: Full fine-tuning (implied by RFT description)

Trainable Parameters: Full model weights (implied)

Training Data:

Initial pool: 70k samples from Geometry3K, GeoQA, Geos, FigureQA, ScienceQA, OK-VQA, IconQA, TabMWP
Filtered subset (7B): 11k samples (K > 5 or unsolved after 50 iters)
Filtered subset (72B): 7.5k samples

Key Hyperparameters:

kl_coefficient_beta: 0.04 (7B) / 0.01 (72B)
learning_rate: 1e-6 (7B) / 5e-7 (72B)
batch_size: 512 (global batch size)
+ 4 more
group_size_G: 8
learning_rate_scheduler: cosine (10% warm-up)
max_prompt_length: 2048
max_completion_length: 2048

Compute: Not reported in the paper

Comparison to Prior Work

vs. Qwen2.5-VL: ThinkLite-VL applies RFT on a difficulty-curated subset, significantly boosting reasoning.
vs. RLEF-V/MM-Eureka: ThinkLite-VL uses MCTS iterations for filtering rather than zero-shot accuracy, capturing fine-grained difficulty.
vs. SFT+RL pipelines (e.g., DeepSeek-VL-RL): ThinkLite-VL skips SFT entirely, showing RFT alone is sufficient if data is selected correctly.

Limitations

Relies on ground truth answers for the reward signal, limiting applicability to open-ended tasks without clear correct answers.
Computational cost of the MCTS filtering step (running search on 70k samples) is likely high, though not explicitly quantified.
Performance gains are heavily dependent on the quality of the base model's initial reasoning capability (needs to be good enough to self-correct).

Reproducibility

Code: https://github.com/si0wang/ThinkLite-VL

Code and models are publicly available. MCTS prompts and Critic prompts are provided in Appendix A. Training hyperparameters are explicitly listed. Dataset sources are standard open-source benchmarks.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on diverse visual reasoning benchmarks.

Benchmarks:

MathVista (Mathematical reasoning in visual contexts)
MathVerse (Geometric and mathematical reasoning)
ScienceQA (Multimodal science questions)
AI2D (Science diagram understanding)
HallusionBench (Visual hallucination detection)
MMBench (General multimodal understanding)
MMStar (Star-topology multimodal evaluation)
MMVet (Integrated multimodal capabilities)

Metrics:

Accuracy (%)
Score (0-100)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ThinkLite-VL models significantly outperform their base models and other open-source/proprietary baselines on the challenging MathVista benchmark.
MathVista	Accuracy	70.2	75.1	+4.9
MathVista	Accuracy	71.9	79.7	+7.8
ThinkLite-VL-7B shows consistent improvements across a wide range of general visual reasoning benchmarks.
MathVerse	Score	57.8	69.1	+11.3
ScienceQA	Accuracy	95.5	95.5	0.0
MMBench	Accuracy	82.3	83.6	+1.3
Ablation studies confirm the effectiveness of MCTS-based selection over random selection.
Average (8 benchmarks)	Average Score	60.89	64.18	+3.29

Experiment Figures

Radar chart comparing ThinkLite-VL-7B against GPT-4o, Qwen2.5-VL-72B, and other baselines on MathVista and other benchmarks.

Histogram of sample difficulty (MCTS iteration count) for the 7B model.

Main Takeaways

Difficulty matters: Training on samples identified as 'hard' (high MCTS iterations) yields significantly better results than random sampling or using the full dataset.
SFT is not mandatory: High-performance reasoning can be unlocked via RFT alone if the data is high-quality and appropriately challenging.
The method scales: Improvements are observed in both 7B and 72B parameter regimes.
Smaller, curated data is more efficient: 11k samples outperformed larger random subsets, validating the 'less is more' hypothesis for RFT.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy optimization, rewards)
Vision-Language Models (VLMs)
Monte Carlo Tree Search (MCTS)

Key Terms

RFT: Reinforcement Fine-Tuning—training a model using reinforcement learning (like PPO or GRPO) to maximize a reward signal, typically after initial pre-training

MCTS: Monte Carlo Tree Search—a search algorithm that navigates a decision tree by simulating future outcomes, used here to estimate how hard a problem is

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of sampled outputs to reduce variance without a separate value network

SFT: Supervised Fine-Tuning—training on labeled examples (input, target) using standard cross-entropy loss

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

VLM: Vision-Language Model—a model capable of processing both image and text inputs