Pixel-space reasoning: A reasoning paradigm where the model executes operations (zoom, select-frame) on the visual input itself, rather than just generating text
RaPR: Rate of Pixel-space Reasoning—the frequency with which the model triggers visual operations during reasoning
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm (used by DeepSeek) that optimizes policies based on group-level relative rewards
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training the model on labeled demonstrations (expert trajectories) before RL
VLM: Vision-Language Model—an AI model capable of processing and reasoning about both images/video and text
RL: Reinforcement Learning—training an agent to take actions that maximize a cumulative reward
Curiosity-driven reward: An intrinsic reward signal that encourages the model to explore actions (like using visual tools) it might otherwise avoid due to difficulty
Visual Operations: Executable functions like 'zoom-in' or 'select-frame' that return visual tokens to the model