← Back to Paper List

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen
Zhejiang University, China, Ant Group, China
arXiv (2025)
MM RL Agent Benchmark

📝 Paper Summary

Active Perception / Active Vision Reinforcement Learning for MLLMs
Active-o3 utilizes Group Relative Policy Optimization (GRPO) to train Multimodal LLMs to actively crop and zoom into images, transforming them from passive observers into agents that actively search for visual information.
Core Problem
Current MLLMs are passive consumers of static global images, leading to poor performance on small, dense, or ambiguous objects because they cannot actively adjust their view to gather more detail.
Why it matters:
  • Passive static views limit the resolution and information available for fine-grained tasks like reading text on distant traffic lights
  • Existing 'zoom-in' heuristics (like in GPT-o3) suffer from inefficient region proposals and inaccurate localization
  • Embodied agents require active information-seeking behaviors to succeed in complex, cluttered real-world environments
Concrete Example: In a zero-shot reasoning scenario on the V* benchmark, a standard model (Qwen2.5-VL) fails to identify the number on a traffic light because the object is too small in the global view. In contrast, Active-o3 actively zooms in on the relevant region to correctly read the number.
Key Novelty
Two-Stage Active Perception Policy via GRPO
  • Decomposes the agent into a Sensing Model (decides where to look/crop) and a Task Model (executes the task/answers questions)
  • Applies Group Relative Policy Optimization (GRPO) to train the sensing behaviors, using rewards that combine task success with heuristic constraints (e.g., format validity)
  • Reformulates active perception on static images as a sequential decision process where the agent updates its 'sensor state' (viewpoint/crop) to maximize information gain
Architecture
Architecture Figure Figure 2
The Active-o3 framework illustrating the interaction between the Sensing Model and the Task Model within an environment loop.
Evaluation Highlights
  • Demonstrates zero-shot success on the V* benchmark (visual search) where baseline Qwen2.5-VL fails (qualitative result from Figure 1)
  • Significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT across downstream tasks (claimed in abstract)
  • Consistently improves performance on small object detection and dense object grounding under fixed computational budgets
Breakthrough Assessment
8/10
First framework to explicitly train MLLMs for active perception using RL (GRPO), moving beyond passive processing or simple heuristics. Addresses a fundamental limitation of current static-view MLLMs.
×