Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

📝 Paper Summary

Active Perception / Active Vision Reinforcement Learning for MLLMs

Active-o3 utilizes Group Relative Policy Optimization (GRPO) to train Multimodal LLMs to actively crop and zoom into images, transforming them from passive observers into agents that actively search for visual information.

Core Problem

Current MLLMs are passive consumers of static global images, leading to poor performance on small, dense, or ambiguous objects because they cannot actively adjust their view to gather more detail.

Why it matters:

Passive static views limit the resolution and information available for fine-grained tasks like reading text on distant traffic lights
Existing 'zoom-in' heuristics (like in GPT-o3) suffer from inefficient region proposals and inaccurate localization
Embodied agents require active information-seeking behaviors to succeed in complex, cluttered real-world environments

Concrete Example: In a zero-shot reasoning scenario on the V* benchmark, a standard model (Qwen2.5-VL) fails to identify the number on a traffic light because the object is too small in the global view. In contrast, Active-o3 actively zooms in on the relevant region to correctly read the number.

Key Novelty

Two-Stage Active Perception Policy via GRPO

Decomposes the agent into a Sensing Model (decides where to look/crop) and a Task Model (executes the task/answers questions)
Applies Group Relative Policy Optimization (GRPO) to train the sensing behaviors, using rewards that combine task success with heuristic constraints (e.g., format validity)
Reformulates active perception on static images as a sequential decision process where the agent updates its 'sensor state' (viewpoint/crop) to maximize information gain

Architecture

The Active-o3 framework illustrating the interaction between the Sensing Model and the Task Model within an environment loop.

Evaluation Highlights

Demonstrates zero-shot success on the V* benchmark (visual search) where baseline Qwen2.5-VL fails (qualitative result from Figure 1)
Significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT across downstream tasks (claimed in abstract)
Consistently improves performance on small object detection and dense object grounding under fixed computational budgets

Breakthrough Assessment

8/10

First framework to explicitly train MLLMs for active perception using RL (GRPO), moving beyond passive processing or simple heuristics. Addresses a fundamental limitation of current static-view MLLMs.

⚙️ Technical Details

Problem Definition

Setting: Active perception over static 2D images modeled as a sequential decision process

Inputs: Instruction I, Initial global observation o_init (low-res view of image I)

Outputs: Sequence of sensing actions (crops) followed by a final task action (answer/detection)

Pipeline Flow

Initialization: Global low-res view + Instruction
Sensing Loop: Sensing Model selects region -> Environment crops region -> Update Observation
Execution: Task Model processes accumulated observations -> Outputs Final Answer

System Modules

Sensing Model (Perception Control)

Decides how to control perception parameters (e.g., defining a bounding box to zoom in on)

Model or implementation: MLLM (Active-o3, optimized via GRPO)

Environment / Sensor (Perception Control)

Executes the sensing action by cropping and resizing the image

Model or implementation: Deterministic function

Task Model

Performs the actual task (e.g., identification, answering) based on observations

Model or implementation: MLLM (Fixed/Frozen in 2D setting)

Novel Architectural Elements

Explicit separation of Sensing Model (policy for information gathering) and Task Model (policy for problem solving)
Closed-loop active vision pipeline applied to static images via dynamic cropping/resizing
Dual-form reward structure integrating task success and heuristic sensing constraints

Modeling

Base Model: Qwen2.5-VL (Inferred from baseline comparisons, specifically Qwen-VL2.5-CoT)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize task success while minimizing sensing costs.

Formally: Maximize E[R(s_t, a_env) - lambda * C(a_cam)]
Purpose: Guide the model via GRPO without a critic.

Formally: Estimate advantage using mean and variance of rewards across a group of sampled outputs.

Compute: Not reported in the paper

Comparison to Prior Work

vs. GPT-o3: Active-o3 uses learned RL policies (GRPO) for region selection rather than heuristic zoom-in strategies, improving efficiency and accuracy.
vs. Qwen-VL2.5-CoT: Active-o3 actively modifies the visual input (via cropping) to gather information, whereas Qwen-VL relies on the static input resolution.

Limitations

No statistical significance tests reported in the provided text.
2D static image setting simplifies the full embodied active perception problem (e.g., no motion parallax).
Requires high-resolution source images to benefit from the zoom-in mechanism.

Reproducibility

Code: https://github.com/aim-uofa/Active-o3

Code is publicly released at https://github.com/aim-uofa/Active-o3. The paper defines the problem formalization and reward structure conceptually.

📊 Experiments & Results

Evaluation Setup

Active perception on 2D static images where the agent crops regions to improve task performance

Benchmarks:

V* (V-Star) (Visual search and reasoning)
Open-world grounding (Small and dense object grounding)
Remote Sensing / Autonomous Driving (Domain-specific small object detection)

Metrics:

Task Success Rate
Search Efficiency
Localization Accuracy
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

A comparison on the V* benchmark between Active-o3 and Qwen2.5 VL.

Main Takeaways

Active-o3 enables MLLMs to solve tasks involving small objects (e.g., traffic light numbers) that are impossible with static global views due to resolution limits.
The GRPO-based training effectively teaches the model 'where to look' without requiring explicit bounding box supervision, relying instead on task success rewards.
The framework generalizes across general open-world tasks and specific domains like remote sensing and autonomous driving.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Reinforcement Learning (RL)
Active Vision / Active Perception

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs against their average reward, removing the need for a separate critic model

Active Perception: The process where an agent actively controls its sensors (e.g., moving a camera, zooming in) to gather information needed to solve a task

Sensing Model: The module responsible for selecting perception parameters, such as which region of an image to crop and inspect next

Task Model: The module responsible for processing observations to produce the final answer or interaction

V*: A benchmark dataset designed to test detailed visual search and reasoning capabilities, often involving small or hard-to-find objects

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer