GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies based on the relative performance of a group of outputs rather than an absolute value function
CoT: Chain-of-Thought—a prompting or training method where the model generates intermediate reasoning steps before producing a final answer
SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset to adapt it to a downstream task
MSLM: Multimodal Small Language Model—compact multimodal models (typically <7B parameters) optimized for efficiency
SAM: Surface-to-Air Missile—a type of military installation targeted for detection in this paper
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm (GRPO is a variant of this)
ViT: Vision Transformer—a model architecture that processes images as sequences of patches, used here as the visual encoder
SAMData: The custom dataset introduced in this paper, containing expert-verified satellite imagery of missile sites and civilian areas
C0/C1/C2: Categories in SAMData: C0 (Easy military/detected by teacher), C1 (Hard military/missed by teacher), C2 (Civilian/Negative)