Reinforced Visual Perception with Tools

📝 Paper Summary

Visual Reasoning Tool-Augmented Multimodal LLMs Reinforcement Learning for Reasoning

ReVPT improves the visual perception of multimodal language models by training them to reason with and use visual tools (like depth estimation and object detection) using Group Relative Policy Optimization (GRPO).

Core Problem

Supervised fine-tuning (SFT) for visual tool use is limited because it relies on expensive, pre-defined trajectories that don't incentivize the model to explore alternative tools or adapt to new visual environments.

Why it matters:

Visual reasoning requires complex perception (depth, edges) that standard VLM embeddings often miss
SFT models struggle to generalize because they memorize fixed tool sequences rather than learning the logic of *when* to use a tool
Existing tool-use approaches rely heavily on expensive GPT-4 generated traces that require aggressive filtering

Concrete Example: When asked to identify an object's distance, a standard VLM might guess based on 2D features. An SFT-trained tool model might call a depth tool but fail to interpret the color-coded map correctly if the map differs from its training data. ReVPT allows the model to explore different interpretations of the tool output during training to find the correct answer.

Key Novelty

Reinforced Visual Perception with Tools (ReVPT)

Replaces static SFT trajectories with a reinforcement learning phase (GRPO) where the model explores different tool combinations to solve visual queries
Uses a 'Cold Start' phase with synthetic data to teach basic tool syntax, followed by RL optimization driven by simple binary rewards (correctness and format)
Integrated suite of four specific perceptual tools (detection, zoom, edge, depth) treated as reasoning steps within the generation process

Architecture

The overall architecture of the ReVPT framework.

Evaluation Highlights

ReVPT-7B outperforms the base Qwen2.5-VL-7B-Instruct by +9.82% on the perception-heavy CV-Bench benchmark
Outperforms commercial giants GPT-4.1 and Gemini-2.0-Flash on the challenging BLINK-Depth and BLINK-Relation subsets
ReVPT-3B achieves a +8.65% improvement on CV-Bench compared to its instruct baseline, showing scalability across model sizes

Breakthrough Assessment

8/10

Strong application of recent RL reasoning advances (GRPO) to the visual tool-use domain. Demonstrates significant gains over SFT baselines, addressing a key bottleneck in multimodal agent reliability.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) where the agent can invoke external vision tools before generating the final answer

Inputs: Image I_in and textual query q

Outputs: Reasoning trace containing tool calls, observations, and final answer o

Pipeline Flow

Multimodal Query Processing (Qwen2.5-VL)
Reasoning & Tool Selection (Policy)
Tool Execution (External Vision Models)
Observation Integration & Final Answer Generation

System Modules

Reasoning Agent

Analyzes query, decides to call tools or answer, and generates reasoning chain

Model or implementation: Qwen2.5-VL-3B-Instruct or Qwen2.5-VL-7B-Instruct

Visual Tools Suite

Executes specific visual tasks to augment model perception

Model or implementation: Various (GroundingDINO, DepthAnythingV2, etc.)

Reward Function

Evaluates generated responses during RL training

Model or implementation: Rule-based

Modeling

Base Model: Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy based on relative advantage of group samples.

Formally: Maximizing clipped surrogate objective sum over group samples, minus KL divergence penalty.

Training Data:

Cold-start: 1.5k samples synthesized by GPT-4.1 (filtered for correctness)
RL Training: 20k questions from SAT and TACO datasets (selected where base model failed)

Key Hyperparameters:

learning_rate: 1e-5 (Cold Start)
batch_size: 64 (Cold Start)
rl_steps: 200
+ 1 more
group_size: Not explicitly reported in the paper (implied by GRPO standard)

Compute: 8x NVIDIA A800 GPUs

Comparison to Prior Work

vs. Taco: ReVPT uses RL (GRPO) for dynamic exploration instead of just SFT, and focuses on 4 core perception tools rather than 15 general tools
vs. VisualSketchPad: ReVPT trains open-weights models via RL rather than prompting proprietary models
vs. Qwen-SAT-SFT: ReVPT uses RL to explore tool usage, showing better generalization than SFT on the same data

Limitations

Visual tools can sometimes hinder performance if they produce erroneous outputs (e.g., misclassification)
Model general capability (non-visual) can degrade slightly due to specialized training on perception tasks
The approach relies on ground truth answers for reward calculation, limiting applicability to open-ended tasks without verifiable answers

Reproducibility

Code: https://github.com/ls-kelvin/REVPT

Code and datasets are publicly available at https://github.com/ls-kelvin/REVPT. Cold-start data generation uses GPT-4.1. Training relies on LLaMA-Factory and Verl platforms.

📊 Experiments & Results

Evaluation Setup

Multimodal evaluation across perception and reasoning benchmarks

Benchmarks:

CV-Bench (Visual Perception)
BLINK (Visual Perception (Hard))
MMVP (Visual Perception)
MMStar (General Multimodal)

Metrics:

Accuracy
Statistical methodology: Average of three runs reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReVPT consistently outperforms the base instruct models, particularly on perception-heavy benchmarks like CV-Bench.
CV-Bench	Accuracy improvement over base	66.25	76.07	+9.82
CV-Bench	Accuracy improvement over base	60.65	69.30	+8.65
BLINK (Relation subset)	Accuracy	55.83	60.83	+5.00
MMVP	Accuracy	63.33	70.33	+7.00

Experiment Figures

Tool usage frequency analysis across different benchmarks for Cold-Start vs. RL phase models.

Main Takeaways

Reinforcement Learning (RL) significantly boosts visual tool usage compared to Supervised Fine-Tuning (SFT) or text-based RL alone.
The 'Cold Start' phase is essential; without it, models struggle to learn tool syntax effectively.
Object Detection is the most impactful tool among the suite, with significant performance drops when removed.
While perception improves drastically, there is a trade-off with general capability maintenance, mitigated by including general data (TACO) during training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Policy Gradients)
Multimodal Large Language Models (MLLMs)
Visual Perception tasks (Depth, Detection, Segmentation)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that evaluates a group of outputs for the same prompt and updates the policy based on their relative advantage, avoiding the need for a separate value network

Cold Start: An initial phase of Supervised Fine-Tuning (SFT) using high-quality synthetic data to teach the model the basic syntax and mechanics of tool usage before RL begins

ReVPT: Reinforced Visual Perception with Tools—the proposed framework combining Cold Start SFT and GRPO for visual agents

SFT: Supervised Fine-Tuning—training a model to mimic a specific dataset of examples

MLLM: Multimodal Large Language Model—an AI model capable of processing and generating both text and images

VQA: Visual Question Answering—the task of answering natural language questions about an image