System 1: Fast, automatic, and intuitive thinking (e.g., recognizing an object instantly)
System 2: Slow, deliberate, and analytical thinking (e.g., solving a math problem step-by-step)
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that updates a policy based on the relative performance of a group of generated outputs for the same input
VLM: Visual Language Model—an AI model capable of processing and understanding both image and text inputs
Rollout: A complete sequence of text generated by the model during the sampling phase of reinforcement learning
Thinking Mode Auto-Labeling: The process of categorizing training data as 'fast' or 'slow' based on the length of answers generated by the pre-trained model
Hybrid Group Response Sampling: A training strategy where half of the model's outputs are forced to use a specific thinking prefix (fast/slow) and the other half are generated freely, to help the model learn the association
KL penalty: Kullback-Leibler divergence penalty—a regularization term used in RL to prevent the trained model from deviating too drastically from the reference model
Hallucination: When a model generates plausible-sounding but factually incorrect or non-existent information