Active Perception / Active VisionReinforcement Learning for MLLMs
Active-o3 utilizes Group Relative Policy Optimization (GRPO) to train Multimodal LLMs to actively crop and zoom into images, transforming them from passive observers into agents that actively search for visual information.
Core Problem
Current MLLMs are passive consumers of static global images, leading to poor performance on small, dense, or ambiguous objects because they cannot actively adjust their view to gather more detail.
Why it matters:
Passive static views limit the resolution and information available for fine-grained tasks like reading text on distant traffic lights
Existing 'zoom-in' heuristics (like in GPT-o3) suffer from inefficient region proposals and inaccurate localization
Embodied agents require active information-seeking behaviors to succeed in complex, cluttered real-world environments
Concrete Example:In a zero-shot reasoning scenario on the V* benchmark, a standard model (Qwen2.5-VL) fails to identify the number on a traffic light because the object is too small in the global view. In contrast, Active-o3 actively zooms in on the relevant region to correctly read the number.
Key Novelty
Two-Stage Active Perception Policy via GRPO
Decomposes the agent into a Sensing Model (decides where to look/crop) and a Task Model (executes the task/answers questions)
Applies Group Relative Policy Optimization (GRPO) to train the sensing behaviors, using rewards that combine task success with heuristic constraints (e.g., format validity)
Reformulates active perception on static images as a sequential decision process where the agent updates its 'sensor state' (viewpoint/crop) to maximize information gain
Architecture
The Active-o3 framework illustrating the interaction between the Sensing Model and the Task Model within an environment loop.
Evaluation Highlights
Demonstrates zero-shot success on the V* benchmark (visual search) where baseline Qwen2.5-VL fails (qualitative result from Figure 1)
Significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT across downstream tasks (claimed in abstract)
Consistently improves performance on small object detection and dense object grounding under fixed computational budgets
Breakthrough Assessment
8/10
First framework to explicitly train MLLMs for active perception using RL (GRPO), moving beyond passive processing or simple heuristics. Addresses a fundamental limitation of current static-view MLLMs.
⚙️ Technical Details
Problem Definition
Setting: Active perception over static 2D images modeled as a sequential decision process
Inputs: Instruction I, Initial global observation o_init (low-res view of image I)
Outputs: Sequence of sensing actions (crops) followed by a final task action (answer/detection)
Pipeline Flow
Initialization: Global low-res view + Instruction
Sensing Loop: Sensing Model selects region -> Environment crops region -> Update Observation
Execution: Task Model processes accumulated observations -> Outputs Final Answer
System Modules
Sensing Model (Perception Control)
Decides how to control perception parameters (e.g., defining a bounding box to zoom in on)
Model or implementation: MLLM (Active-o3, optimized via GRPO)
Environment / Sensor (Perception Control)
Executes the sensing action by cropping and resizing the image
Model or implementation: Deterministic function
Task Model
Performs the actual task (e.g., identification, answering) based on observations
Model or implementation: MLLM (Fixed/Frozen in 2D setting)
Novel Architectural Elements
Explicit separation of Sensing Model (policy for information gathering) and Task Model (policy for problem solving)
Closed-loop active vision pipeline applied to static images via dynamic cropping/resizing
Dual-form reward structure integrating task success and heuristic sensing constraints
Modeling
Base Model: Qwen2.5-VL (Inferred from baseline comparisons, specifically Qwen-VL2.5-CoT)
Training Method: Group Relative Policy Optimization (GRPO)
Objective Functions:
Purpose: Maximize task success while minimizing sensing costs.
Purpose: Guide the model via GRPO without a critic.
Formally: Estimate advantage using mean and variance of rewards across a group of sampled outputs.
Compute: Not reported in the paper
Comparison to Prior Work
vs. GPT-o3: Active-o3 uses learned RL policies (GRPO) for region selection rather than heuristic zoom-in strategies, improving efficiency and accuracy.
vs. Qwen-VL2.5-CoT: Active-o3 actively modifies the visual input (via cropping) to gather information, whereas Qwen-VL relies on the static input resolution.
Limitations
No statistical significance tests reported in the provided text.
2D static image setting simplifies the full embodied active perception problem (e.g., no motion parallax).
Requires high-resolution source images to benefit from the zoom-in mechanism.
Code is publicly released at https://github.com/aim-uofa/Active-o3. The paper defines the problem formalization and reward structure conceptually.
📊 Experiments & Results
Evaluation Setup
Active perception on 2D static images where the agent crops regions to improve task performance
Benchmarks:
V* (V-Star) (Visual search and reasoning)
Open-world grounding (Small and dense object grounding)
Remote Sensing / Autonomous Driving (Domain-specific small object detection)
Metrics:
Task Success Rate
Search Efficiency
Localization Accuracy
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
A comparison on the V* benchmark between Active-o3 and Qwen2.5 VL.
Main Takeaways
Active-o3 enables MLLMs to solve tasks involving small objects (e.g., traffic light numbers) that are impossible with static global views due to resolution limits.
The GRPO-based training effectively teaches the model 'where to look' without requiring explicit bounding box supervision, relying instead on task success rewards.
The framework generalizes across general open-world tasks and specific domains like remote sensing and autonomous driving.
📚 Prerequisite Knowledge
Prerequisites
Multimodal Large Language Models (MLLMs)
Reinforcement Learning (RL)
Active Vision / Active Perception
Key Terms
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs against their average reward, removing the need for a separate critic model
Active Perception: The process where an agent actively controls its sensors (e.g., moving a camera, zooming in) to gather information needed to solve a task
Sensing Model: The module responsible for selecting perception parameters, such as which region of an image to crop and inspect next
Task Model: The module responsible for processing observations to produce the final answer or interaction
V*: A benchmark dataset designed to test detailed visual search and reasoning capabilities, often involving small or hard-to-find objects
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer